Lecture 16 - Radial Basis Functions

By caltech

Summary

Topics Covered

核函数实现无限维SVM
软间隔C平衡错误与泛化
RBF基于距离局部影响
K均值聚类选RBF中心
RBF源于平滑正则化

Full Transcript

1 00:00:00,580 --> 00:00:03,275 声明：以下课程由加州理工大学（Caltech）出品。 ANNOUNCER: The following program is brought to you by Caltech.

径向基函数 Subtitles provided by FinVolution Group’s Smart Finance Institute YASER ABU-MOSTAFA教授：欢迎回来。 YASER ABU-MOSTAFA: Welcome back.

YASER ABU-MOSTAFA教授：欢迎回来。 YASER ABU-MOSTAFA: Welcome back.

上一节课，我们讨论了核方法，核方法是基础SVM算法的一个泛化版本， Last time, we talked about kernel methods, which is a generalization of 用以在高维（可能无限）Z空间中也能适用， the basic SVM algorithm to accommodate feature spaces Z, which 并且我们不需要事先知道Z也不需要对输入值进行转换， are possibly infinite and which we don't have to explicitly know, or

就能在这种情况下使用支持 transform our inputs to, in order to be able to carry out the support 向量机。 vector machinery.

向量机。 vector machinery.

我们的方法就是提前定义一个核函数来获取该空间的向量 And the idea was to define a kernel that captures the inner product in 内积。 that space.

内积。 that space.

并且如果你能计算出该核函数：Z空间的向量内积， And if you can compute that kernel, the generalized inner product for the 这会是你使用该算法唯一需要的做的操作， Z space, this is the only operation you need in order to carry the 也是该解决方法唯一需要解释的部分。 algorithm, and in order to interpret the solution after you get it.

也是该解决方法唯一需要解释的部分。 algorithm, and in order to interpret the solution after you get it.

并且我们举了个例子：RBF核函数， And we took an example, which is the RBF kernel, suitable since we are 今天我们会对RBF核函数 -- 径向基函数 -- 作进一步的讨论。 going to talk about RBF's, radial basis functions, today.

你们会看到RBF核函数是很容易计算得到的。 And the kernel is very simple to compute in terms of x.

一点都不难。 It's not that difficult.

然而，由于这对应了一个无限维的空间：Z空间。 However, it corresponds to an infinite dimensional space, Z space.

因此我们需要把二维空间的每个点 And therefore, by doing this, it's as if we transform every point in this 转化到无限维空间， space, which is two-dimensional, into an infinite-dimensional space, carry 并且在那里进行SVM的计算并对方法进行解释。 out the SVM there, and then interpret the solution back here.

并且在那里进行SVM的计算并对方法进行解释。 out the SVM there, and then interpret the solution back here.

这将会是该空间对应的切割平面， And this would be the separating surface that corresponds to a plane, 换句话说，就是无限维空间的切割平面。 so to speak, in that infinite dimensional space.

换句话说，就是无限维空间的切割平面。 so to speak, in that infinite dimensional space.

有了这个方法，我们可以用另一种方式来泛化SVM， So with this, we went into another way to generalize SVM, not by having 不用进行非线性变化，而是采用一定的容错机制。 a nonlinear transform in this case, but by having an allowance for errors.

不用进行非线性变化，而是采用一定的容错机制。 a nonlinear transform in this case, but by having an allowance for errors.

这些错误是指样本划分的错误（不满足间隔的约束）。 Errors in this case would be violations of the margin.

这就是在SVM中用到的软间隔。 The margin is the currency we use in SVM.

我们给这个目标函数添加了一项，使得我们能够允许 And we added a term to the objective function that allows us to violate the 每个点可以出错，关于变量xi。 margin for different points, according to the variable xi.

每个点可以出错，关于变量xi。 margin for different points, according to the variable xi.

然后我们得到一个总的错误值，就是这个加和。 And we have a total violation, which is this summation.

并且我们会定义一个常数：C，代表我们对错误的容忍程度。 And then we have a degree to which we allow those violations.

当C很大时，表明我们对错误的容忍程度低。 If C is huge, then we don't really allow the violations.

如果C趋向无穷大，就等于回到了硬间隔的情况。 And if C goes to infinity, we are back to the hard-margin case.

如果C很小，表明我们对错误的容忍度 And if C is very small, then we are more tolerant and would allow 比较高。 violations.

比较高。 violations.

在这种情况下，我们会允许一些错误， And in that case, we might allow some violations here and there, and then 来换取比较小的w，这意味着比较大的间隔，就是这片比较大的黄色区域 have a smaller w, which means a bigger margin, a bigger yellow region that is 可以看到有些点划分错了。 violated by those guys.

可以看到有些点划分错了。 violated by those guys.

这在设计上给了我们另一种程度的自由。 Think of it as-- it gives us another degree of freedom in our design.

考虑这种情况：某些数据集里，会有一些异常点， And it might be the case that in some data sets, there are a couple of 如果只是为了追求分隔这些异常点而去缩小间隔面， outliers where it doesn't make sense to shrink the margin just to 或者通过非线性变化映射到高维空间， accommodate them, or by going to a higher-dimensional space with 这并不是很有意义，并且会产生过多的 a nonlinear transformation to go around that point and, therefore, generate so 支持向量。 many support vectors.

支持向量。 many support vectors.

因此，忽略这些错误可能是个不错的想法， And therefore, it might be a good idea to ignore them, and ignoring them 也就是说我们会容忍一些划分错误， meaning that we are going to commit a violation of the margin.

这可能是一些完全分错的点。 Could be an outright error.

也可能是一些在间隔内的点， Could be just a violation of the margin where we are here, but we haven't 但它们并没有越过超平面，比方说。 crossed the boundary, so to speak.

但它们并没有越过超平面，比方说。 crossed the boundary, so to speak.

因此，这给了我们另一个方法来得到更好的 And therefore, this gives us another way of achieving the better 泛化形式，代价就是允许一些样本错误，或者间隔错误， generalization by allowing some in-sample error, or margin error in this 好处就是能得到一个更好的泛化版本。 case, at the benefit of getting better generalization prospects.

好处就是能得到一个更好的泛化版本。 case, at the benefit of getting better generalization prospects.

现在好消息是：尽管我们对问题描述进行了较大改变 Now the good news here is that, in spite of this significant 改变，但解决方案 modification of the statement of the problem, the solution was identical to 和之前并无不同。 what we had before.

和之前并无不同。 what we had before.

我们同样应用二次规划，同样的目标函数 We are applying quadratic programming, with the same objective, the same 同样的等式约束条件，几乎同样的不等式约束条件。 equality constraint, and almost the same inequality constraint.

同样的等式约束条件，几乎同样的不等式约束条件。 equality constraint, and almost the same inequality constraint.

唯一的区别就是之前的α_n The only difference is that it used to be, alpha_n could be 可以是无穷大。 as big as it wants.

可以是无穷大。 as big as it wants.

现在最多只能和C一样。并且当你把参数传入二次规划后， Now it is limited by C. And when you pass this to quadratic 你将得到你的解决方案。 programming, you will get your solution.

你将得到你的解决方案。 programming, you will get your solution.

现在C成了一个参数 -- Now C being a parameter-- 并且我们不是很清楚C应该怎么选择。 and it is not clear how to choose it.

并且我们不是很清楚C应该怎么选择。 and it is not clear how to choose it.

有一个折中方案可以进行选择。 There is a compromise that I just described.

最好的选择C的方法，也是我们实际中一直用的，就是 The best way to pick C, and it is the way used in practice, is to use 交叉验证法。 cross-validation to choose it.

交叉验证法。 cross-validation to choose it.

就是说你使用不同的C的取值，然后进行计算 So you apply different values of C, you run this and see what is the 并估计样本外误差，用交叉验证法选择 out-of-sample error estimate, using your cross-validation, and then pick the C 可以最小化样本外误差的C。 that minimizes that.

可以最小化样本外误差的C。 that minimizes that.

你只需要用这个方法就能得到参数C。 And that is the way you will choose the parameter C.

所以这就是SVM的基本概念，硬间隔，软间隔以及 So that ends the basic part of SVM, the hard margin, the soft margin, and 非线性变化还有相对应的核函数。 the nonlinear transforms together with the kernel version of them.

非线性变化还有相对应的核函数。 the nonlinear transforms together with the kernel version of them.

他们组合在一起，可以成为一个非常好的分类器。 Together they are a technique really that is superb for classification.

这也是很多人会选择的模型， And it is, by the choice of many people, the model of choice when it 当他们解决分类问题时。 comes to classification.

当他们解决分类问题时。 comes to classification.

只需要一点额外代价。 Very small overhead.

有一个特定的标准使得我们可以作出比随机选择一个切割面 There is a particular criterion that makes it better than just choosing 更好的选择。 a random separating plane.

更好的选择。 a random separating plane.

也因此，这样的优化会体现在样本外测试的表现中。 And therefore, it does reflect on the out-of-sample performance.

今天，我们就会讨论一个新模型，径向基函数（以下简称RBF）。 Today's topic is a new model, which is radial basis functions.

这不是一个全新的概念，因为我们在讨论SVM的时候，也讨论过RBF， Not so new, because we had a version of it under SVM and we'll be able to 它和SVM有一定联系。 relate to it.

它和SVM有一定联系。 relate to it.

但RBF本身也是一个有趣的模型。 But it's an interesting model in its own right.

它能帮我们对输入空间有具体的理解， It captures a particular understanding of the input space that 我们稍后将讲到。 we will talk about.

我们稍后将讲到。 we will talk about.

但RBF最重要的地方在于， But the most important aspect that the radial basis functions provide for us 它不仅和我们目前接触到的机器学习中的许多方法有关联， is the fact that they relate to so many facets of machine learning that 也和一些我们目前还没接触到的领域例如模式识别有关， we have already touched on, and other aspects that we didn't touch on in 所以这个模型非常值得我们来理解学习， pattern recognition, that it's worthwhile to understand the model and 来看看到底存在一个什么样的关联。 see how it relates.

来看看到底存在一个什么样的关联。 see how it relates.

它就像粘合剂，将机器学习领域的多个主题 It almost serves as a glue between so many different 联系起来。 topics in machine learning.

联系起来。 topics in machine learning.

这也是我们学习这门课的重点。 And this is one of the important aspects of studying the subject.

这里是我们今天的大纲 -- 我不会照本宣科，挨个来讲这 So the outline here-- it's not like I'm going to go through one item then 上面所有的项目。 the next according to this outline.

上面所有的项目。 the next according to this outline.

取而代之， What I'm going to do-- 我会定义模型，定义算法，诸如此类， I'm going to define the model, define the algorithms, and so on, as I would 然后推广到任意模型。 describe any model.

然后推广到任意模型。 describe any model.

在这个过程中，我可以在不同阶段都讲到 In the course of doing that, I will be able, at different stages, to relate RBF，首先，是最近邻问题，这是一个 RBF to, in the first case, nearest neighbors, which is a standard model 模式识别中的经典模型。 in pattern recognition.

模式识别中的经典模型。 in pattern recognition.

我们可以把它和我们已经学过的神经网络联系起来， We will be able to relate it to neural networks, 103 嗯，还有核方法， 00:06:19,811 --> 00:06:21,060 which we have already studied, to kernel methods 显然核方法和RBF是有关联的。 obviously-- it should relate to the RBF kernel.

显然核方法和RBF是有关联的。 obviously-- it should relate to the RBF kernel.

它也确实是有关联。 And it will.

最后，它会涉及到正则化， And finally, it will relate to regularization, which is actually the 在学习RBF中，可以追根溯源到函数逼近。 origin, in function approximation, for the study of RBF's.

在学习RBF中，可以追根溯源到函数逼近。 origin, in function approximation, for the study of RBF's.

让我们首先描述一下最基本的RBF模型。 So let's first describe the basic radial basis function model.

这里的想法是数据集里的每个点都会影响在x点的假设函数 The idea here is that every point in your data set will influence the value 也就是h(x)。 of the hypothesis at every point x.

也就是h(x)。 of the hypothesis at every point x.

听上去没啥新鲜的。 Well, that's nothing new.

这就是你在机器学习中所做的的事情。 That's what happens when you are doing machine learning.

你从数据中学习。 You learn from the data.

然后选择一个假设。 And you choose a hypothesis.

很显然，这个假设会受数据影响。 So obviously, that hypothesis will be affected by the data.

但这里，影响会以一种特别的方式发生。 But here, it's affected in a particular way.

这种影响通过距离来表现。 It's affected through the distance.

因此数据集里的一个点，它对近邻的点的影响，会大于其对远邻 So a point in the data set will affect the nearby points, more than it affects 的影响。 the far-away points.

的影响。 the far-away points.

这是RBF的关键组成部分。 That is the key component that makes it a radial basis function.

看这个图片。 Let's look at a picture here.

想象一下这个凸起的中心点就是一个数据点。 Imagine that the center of this bump happens to be the data point.

我们叫它x_n。 So this is x_n.

这条线展现了x_n对空间中的相邻点 And this shows you the influence of x_n on the neighboring 的影响。 points in the space.

的影响。 points in the space.

越近的点影响越大。 So it's most influential nearby.

然后影响逐步变小直到没有。 And then the influence goes by and dies.

还有一点就是它是对称的，这意味着这个函数 And the fact that this is symmetric around, means that it's function only of 只会和距离有关，在当前条件下。 the distance, which is the condition we have here.

只会和距离有关，在当前条件下。 the distance, which is the condition we have here.

让我把RBF模型的具体标准形式 So let me give you, concretely, the standard form of a radial basis 给你们。 function model.

给你们。 function model.

函数形式是h(x) It starts from h of x being-- 等于右边这个等式。 and here are the components that build it.

等于右边这个等式。 and here are the components that build it.

就像我们之前说的，这个函数和距离有关。 As promised, it depends on the distance.

并且你和x_n距离越近，你受到的影响越大， And it depends on the distance such that the closer you are to x_n, the 就像你们图片中看到的这样。 bigger the influence is, as seen in this picture.

就像你们图片中看到的这样。 bigger the influence is, as seen in this picture.

所以如果你采用x减去x_n平方的范数。 So if you take the norm of x minus x_n squared.

并且你取负 -- γ是一个正的参数，目前固定， And you take minus-- γ is a positive parameter, fixed for the moment, you will 你会看到这个指数真实反映了那张图。 see that this exponential really reflects that picture.

你会看到这个指数真实反映了那张图。 see that this exponential really reflects that picture.

你走得越远，你就会下降。 The further you are away, you go down.

这个下降趋势像高斯（分布）一样。 And you go down as a Gaussian.

所以这是对点x的贡献， So this is the contribution to the point x, at which we are evaluating 我们根据数据集里的点x_n来评估函数。 the function, according to the data point x_n, from the data set.

我们根据数据集里的点x_n来评估函数。 the function, according to the data point x_n, from the data set.

现在我们受到数据集里每个点的影响。 Now we get an influence from every point in the data set.

而这些影响将有一个参数， And those influences will have a parameter that reflects the value, as 反映了目前我们看到的目标值。 we will see in a moment, of the target here.

反映了目前我们看到的目标值。 we will see in a moment, of the target here.

所以它会受到y_n的影响。 So it will be affected by y_n.

这就是影响 -- 正在传播y_n值。 That's the influence-- is having the value y_n propagate.

所以我不打算把y_n放在这里。 So I'm not going to put it as y_n here.

我会在这里放一个待确定的权重（w_n）。 I'm just going to put it generically as a weight to be determined.

我们会发现它与y_n非常相关。 And we'll find that it's very much correlated to y_n.

然后我们会加和所有这些点的影响， And then we will sum up all of these influences, from all the data points, 得到最终的模型。 and you have your model.

得到最终的模型。 and you have your model.

所以这是RBF的标准模型。 So this is the standard model for radial basis functions.

现在让我根据这张幻灯片描述，为什么 Now let me, in terms of this slide, describe why it is called 它被称为径向基函数。 radial, basis function.

它被称为径向基函数。 radial, basis function.

首先，它是径向的，是基于这个（||x-x_n||）。 It's radial because of this.

其次，它是基函数，是基于这个（指数函数）。 And it's basis function because of this.

这是你的构建基石。 This is your building block.

你可以使用其他形式的基函数。 You could use another basis function.

比如你可以有其他的形状，也是关于中心对称的， So you could have another shape, that is also symmetric in center, and has 并以不同的方式产生影响。 the influence in a different way.

并以不同的方式产生影响。 the influence in a different way.

我们稍后会看到一个例子。 And we will see an example later on.

但这基本上是形式最简单的模型， But this is basically the model in its simplest form, and its 也是最常用的形式。 most popular form.

也是最常用的形式。 most popular form.

大多数人会像这样使用高斯（分布）。 Most people will use a Gaussian like this.

这将是假设的函数式。 And this will be the functional form for the hypothesis.

现在我们有了模型。 Now we have the model.

我们通常会问的下一个问题是，什么是学习算法？ The next question we normally ask is what is the learning algorithm?

那么一般的学习算法是什么？ So what is a learning algorithm in general?

你想找到参数。 You want to find the parameters.

我们有一组参数w_1到w_N。 And we call the parameters w_1 up to w_N.

他们有这种函数式。 And they have this functional form.

所以我现在把它们变成紫色，因为它们是变量。 So I put them in purple now, because they are the variables.

其他都是固定的。 Everything else is fixed.

我们希望找到最小化错误的w_n。 And we would like to find the w_n's that minimize some sort of error.

显然，我们基于训练数据计算错误。 We base that error on the training data, obviously.

所以我现在要做的是，我将评估数据点的假设， So what I'm going to do now, I'm going to evaluate the hypothesis on the data 并尝试使它们与这些点上的目标值匹配 -- points, and try to make them match target value on those points-- try to 尝试使它们匹配y。 make them match y.

尝试使它们匹配y。 make them match y.

正如我所说，w_n不会完全是y_n，但它会受到它的影响。 As I said, w_n won't be exactly y_n, but it will be affected by it.

现在有一个有趣的符号点，因为这些点 Now there is an interesting point of notation, because the points appear 在模型中明确显示。 explicitly in the model.

在模型中明确显示。 explicitly in the model.

x_n是第n个训练输入。 x_n is the n-th training input.

现在，我将在训练点上对此进行评估，以评估 And now I'm going to evaluate this on a training point, in order to evaluate 样本内错误。 the in-sample error.

样本内错误。 the in-sample error.

因此，有一个有趣的记号。 So because of this, there will be an interesting notation.

比方说，当我们雄心勃勃地要求样本内错误为0时。 When we, let's say, ask ambitiously to have the in-sample error being 0.

我希望在数据点上完全正确。 I want to be exactly right on the data points.

我应该期待我能够做到这一点。 I should expect that I will be able to do that.

为什么？ Why?

因为我这里有很多参数，不是吗？ Because really I have quite a number of parameters here, don't I?

我有N个数据点。 I have N data points.

我正在尝试学习N个参数。 And I'm trying to learn N parameters.

尽管该语句具有泛化的后果， Notwithstanding the generalization ramifications of that statement, it 但应该很容易获得真正将 should be easy to get parameters that really knock down the 样本内错误降为0的参数。 in-sample error to 0.

样本内错误降为0的参数。 in-sample error to 0.

所以在这样做时，我要做的是，我将把它应用到每个 So in doing that, what I'm going to do, I'm going to apply this to every 点x_n，并要求假设的输出等于y_n。 point x_n, and ask that the output of the hypothesis be equal to y_n.

点x_n，并要求假设的输出等于y_n。 point x_n, and ask that the output of the hypothesis be equal to y_n.

完全没有错误。 No error at all.

实际上，样本内错误将为0。 So indeed, the in-sample error will be 0.

我们在这里用等式代替。 Let's substitute in the equation here.

这对于所有n（从1）到N都是如此，这就是你所拥有的。 And this is true for all n up to N, and here is what you have.

首先，你意识到我在这里更改了哑 First, you realize that I changed the name of the dummy 变量的名称，就是这里的索引。 variable, the index here.

变量的名称，就是这里的索引。 variable, the index here.

我把它从n改为m。 I changed it from n to m.

与x_m一起使用。 And this goes with x_m here.

我之所以这样做，是因为我要在x_n上对此进行评估。 The reason I did that, because I'm going to evaluate this on x_n.

显然，你不应该将哑 And obviously, you shouldn't have recycling of the dummy 变量作为真正的变量循环使用。 variable as a genuine variable.

变量作为真正的变量循环使用。 variable as a genuine variable.

因此，在这种情况下，你需要计算这个量，在这种情况下， So in this case, you want this quantity, which will in this case be 将在点x_n处评估h。 the evaluation of h at the point x_n.

将在点x_n处评估h。 the evaluation of h at the point x_n.

你希望它等于y_n。 You want this to be equal to y_n.

这就是条件。 That's the condition.

并且你希望这对于n等于1到N是正确的。 And you want this to be true for n equals 1 to N.

并不难解决。 Not that difficult to solve.

那么我们来寻找解决方案吧。 So let's go for the solution.

这些是方程式。 These are the equations.

我们问自己：有多少方程式和多少未知数？ And we ask ourselves: how many equations and how many unknowns?

好，我有N个数据点。 Well, I have N data points.

我列出了这些方程式中的N个。 I'm listing N of these equations.

确实，我有N个方程式。 So indeed, I have N equations.

我有多少未知数？ How many unknowns do I have?

那么，未知数是什么？ Well, what are the unknowns?

未知数是w。 The unknowns are the w's.

我碰巧有N个未知数。 And I happen to have N unknowns.

这是熟悉的领域。 That's familiar territory.

我需要做的只是解决它。 All I need to do is just solve it.

让我们把它放在矩阵形式，这将使它变得容易。 Let's put it in matrix form, which will make it easy.

这是矩阵形式，包含n和m的所有系数。 Here is the matrix form, with all the coefficients for n and m.

你可以看到这（一列）从1到N。 You can see that this goes from 1 to N.

这一行从1到N。 And the second guy goes from 1 to N.

这些是系数。 These are the coefficients.

你将它乘以w的向量。 You multiply this by a vector of w's.

所以我将所有N个方程式同时放入，以矩阵形式。 So I'm putting all the N equations at once, as in matrix form.

而且我要求它等于y的向量。 And I'm asking this to be equal to the vector of y's.

让我们给矩阵起个名。 Let's call the matrices something.

这个矩阵我打算叫Φ。 This matrix I'm going to call Φ.

而我正在重复使用符号Φ。 And I am recycling the notation Φ.

Φ曾被用作非线性变换。 Φ used to be the nonlinear transformation.

这确实是种非线性转换。 And this is indeed a nonlinear transformation of sorts.

我们将讨论细微差别。 Slight difference that we'll discuss.

但我们仍可称之为Φ。 But we can call it Φ.

然后这些以标准名称命名，矢量w And then these guys will be called the standard name, the vector w 和矢量y。 and the vector y.

和矢量y。 and the vector y.

这是什么解决方案？ What is the solution for this?

为了保证解决方案，你要保证是Φ可逆的， All you ask for, in order to guarantee a solution, is that Φ be invertible, 在这些条件下，解决方案非常简单：w等于 that under these conditions, the solution is very simply just: w equals Φ的逆乘以y。 the inverse of Φ times y.

Φ的逆乘以y。 the inverse of Φ times y.

在这种情况下，你可以把你的解看作“精确插值”， In that case, you interpret your solution as exact interpolation, 因为你实际所做的是，在你知道值的点上， because what you are really doing is, on the points that you know the value, 即训练点上，你将获得准确的值。 which are the training points, you are getting the value exactly.

即训练点上，你将获得准确的值。 which are the training points, you are getting the value exactly.

这就是你解决的问题。 That's what you solved for.

而现在核，在这种情况下是高斯核，它的作用 And now the kernel, which is the Gaussian in this case, what it does is 是在点之间进行插值，以给出其他在x上的值。 interpolate between the points to give you the value on the other x's.

是在点之间进行插值，以给出其他在x上的值。 interpolate between the points to give you the value on the other x's.

它是对的，因为你在这些点上完全正确。 And it's exact, because you get it exactly right on those points.

现在让我们来看看γ的效果。 Now let's look at the effect of γ.

有一个γ参数，这个参数我认为是固定的， There was a γ, a parameter, that I considered fixed 从一开始。 from the very beginning.

从一开始。 from the very beginning.

这个参数 This guy-- 我用红色突出显示它。 so I'm highlighting it in red.

我用红色突出显示它。 so I'm highlighting it in red.

当我给你一个γ时，你会执行 When I give you a value of γ, you carry out the 我刚才描述的那套机制。 machinery that I just described.

我刚才描述的那套机制。 machinery that I just described.

但你会怀疑γ会影响结果。 But you suspect that γ will affect the outcome.

事实上，它会。 And indeed, it will.

我们来看看两种情况。 Let's look at two situations.

比方说γ很小。 Let's say that γ is small.

当γ很小时会发生什么？ What happens when γ is small?

会这样，高斯（分布）是矮胖的，像这样。 What happens is that this Gaussian is wide, going this way.

如果γ很大，那么会像这样，（高斯分布是高瘦的）。 If γ was large, then I would be going this way.

所以最终分布不仅取决于点的位置，它们分布的稀疏性， Now depending obviously on where the points are, how sparse they are, it 还取决于你是以较大的γ进行插值，还是较小的γ进行插值， makes a big difference whether you are interpolating with something that goes 都会产生很大的不同。 this way, or something that goes this way.

都会产生很大的不同。 this way, or something that goes this way.

它反映在这张照片中。 And it's reflected in this picture.

我们假设你采取这种情况。 Let's say you take this case.

图中有三个点。 And I have three points just for illustration.

三个点插值后得到的总的曲线恰好通过这些 The total contribution of the three interpolations passes exactly through 点，因为这是我要解的东西。我必须要通过这些点。 the points, because this is what I solved for. That's what I insisted on.

点，因为这是我要解的东西。我必须要通过这些点。 the points, because this is what I solved for. That's what I insisted on.

这里下面的灰色曲线是每个点 But the small gray ones here are the contribution 对应的（高斯分布）曲线。 according to each of them.

对应的（高斯分布）曲线。 according to each of them.

所以这些点对应的是w_1，w_2，w_3。 So this would be w_1, w_2, w_3 if these are the points.

当你加上w_1乘以高斯时，加上w_2乘高斯，等等。 And when you add w_1 times the Gaussian, plus w_2 times the Gaussian, et cetera.

你会得到一条曲线，也会得到y_1，y_2和y_3。 You get a curve that gives you exactly the y_1, y_2, and y_3.

因为分布是矮胖的（γ较小），这种插值是 Now because of the width, there is an interpolation here that is 较好的。 successful.

较好的。 successful.

在两点之间，你可以进行有意义的 Between two points, you can see that there is a meaningful 的插值。 interpolation.

的插值。 interpolation.

如果你选择了一个大的γ，这就是你所得到的。 If you go for a large γ, this is what you get.

虽然仍然是高斯（分布）。 Now the Gaussians are still there.

你可能会隐约看到它们（灰色的部分）。 You may see them faintly.

但他们（的影响）很快就没了。 But they die out very quickly.

因此，尽管你仍然满足你的方程 And therefore, in spite of the fact that you are still satisfying your 因为那就是你要解的东西，但是这里的插值 equations, because that's what you solved for, the interpolation here is 非常差，因为这个点（对别的点）的影响消失了， very poor because the influence of this point dies out, and the influence 这个点（对别的点）的影响也消失了 of this point dies out.

所以介于两点之间，你什么也得不到。 So in between, you just get nothing.

所以请记住，γ很重要。 So clearly, γ matters.

你可能认为，γ也与点之间的距 And you probably, in your mind, think that γ matters also in relation to 离有关，因为这 the distance between the points, because that's what the 就是插值。 interpolation is.

就是插值。 interpolation is.

我们将讨论最终的γ选择。 And we will discuss the choice of γ towards the end.

在我们解决了所有其他参数之后，我们将看看 After we settle all the other parameters, we will go and visit γ 如何明智地选择γ。 and see how we can choose it wisely.

如何明智地选择γ。 and see how we can choose it wisely.

考虑到这一点，我们有一个模型。 With this in mind, we have a model.

但是，如果你看一下，那个模型就是一个回归模型。 But that model, if you look at it, is a regression model.

我认为输出是实值的。 I consider the output to be real-valued.

我将实值输出与目标输出相匹配， And I match the real-valued output to the target output, which 目标输出也是实值。 is also real-valued.

目标输出也是实值。 is also real-valued.

通常，我们将使用RBF进行分类。 Often, we will use RBF's for classification.

当你关注h(x)时，它是这样回归的 -- When you look at h of x, which used to be regression this way-- it gives 它给你一个实数。 you a real number.

它给你一个实数。 you a real number.

现在我们将像往常一样采用符号函数，+1或 Now we are going to take, as usual, the sign of this quantity, +1 or -1，并将输出解释为是/否决定。 -1, and interpret the output as a yes/no decision.

-1，并将输出解释为是/否决定。 -1, and interpret the output as a yes/no decision.

我们问问自己：在这些条件下 And we would like to ask ourselves: how do we learn the w's under these 我们如何学习w？ conditions?

我们如何学习w？ conditions?

这不应该是一个非常陌生的情况，因为你已经学过 That shouldn't be a very alien situation to you, because you have seen 用于分类的线性回归。 before linear regression used for classification.

用于分类的线性回归。 before linear regression used for classification.

这就是我们在这里要做的事情。 That is pretty much what we are going to do here.

我们将关注内部部分，即取符号前 We are going to focus on the inner part, which is the signal before we 的信号 take the sign.

我们将尝试使信号本身与+ 1，-1目标匹配， And we are going to try to make the signal itself match the +1,-1 就像我们使用线性回归进行分类时一样。 target, like we did when we used linear regression for classification.

就像我们使用线性回归进行分类时一样。 target, like we did when we used linear regression for classification.

在我们完成之后，因为我们努力使它成为 And after we are done, since we are trying hard to make it +1或-1， +1 or -1, 如果我们成功 - 我们得到了准确的解， and if we are successful-- we get exact solution, 它的符号显然会是+1或-1 ，如果你 then obviously the sign of it will be +1 or -1 if you're 成功的话。 successful.

成功的话。 successful.

如果你没有成功，并且有一个错误，就像在其他 If you are not successful, and there is an error, as will happen in other 情况下会发生的那样，那么至少既然你试图让它接近+1，并且你试图 cases, then at least since you try to make it close to +1, and you try 使另一个接近-1，你会认为符号 to make the other one close to -1, you would think that the sign, at 至少是趋向于+1或-1。 least, will agree with +1 or -1.

至少是趋向于+1或-1。 least, will agree with +1 or -1.

所以这里的信号是之前的假设值。 So the signal here is what used to be the whole hypothesis value.

而你正在尝试做的是，你试图最小化该信号值 And what you're trying to do, you are trying to minimize the mean squared 和y之间的均方误差，知道y实际上 -- error between that signal and y, knowing that y actually-- 在训练集上 - 知道y只是+1或-1。 on the training set-- knowing that y is only +1 or -1.

所以你解决了这个问题。 So you solve for that.

然后当你得到s时，你会将那个s的符号作为你的结果。 And then when you get s, you report the sign of that s as your value.

因此，我们准备使用之前的解决方案，以备我们使用RBF So we are ready to use the solution we had before in case we are using RBF's 进行分类。 for classification.

进行分类。 for classification.

现在我们来观察径向基函数 Now we come to the observation that the radial basis functions are related 与其他模型的关系。 to other models.

与其他模型的关系。 to other models.

我将从一个我们没有涉及的模型开始。 And I'm going to start with a model that we didn't cover.

它很简单，在五分钟内就能弄懂。 It's extremely simple to cover in five minutes.

它显示了径向基函数的一个重要方面。 And it shows an aspect of radial basis functions that is important.

这是最近邻方法。 This is the nearest-neighbor method.

那么让我们来看看吧。 So let's look at it.

最近邻的想法是我给你一个数据集。 The idea of nearest neighbor is that I give you a data set.

每个数据点的值为y_n。 And each data point has a value y_n.

这可能是一个标签，如果是分类问题的话， Could be a label, if you're talking about classification, 也可能是一个数值。 could be a real value.

也可能是一个数值。 could be a real value.

而你为分类其他点或将值分配给 And what you do for classifying other points, or assigning values to other 其他点所做的工作，非常简单。 points, is very simple.

其他点所做的工作，非常简单。 points, is very simple.

你可以查看该训练集中最近的点，直到 You look at the closest point within that training set, to the point that 你正在考虑的那一点。 you are considering.

你正在考虑的那一点。 you are considering.

所以你有x。 So you have x.

你会看训练集中最接近x的x_n是哪些， You look at what is x_n in the training set that is closest to me, 在欧几里德距离下。 in Euclidean distance.

在欧几里德距离下。 in Euclidean distance.

然后继承该点所具有的标签或值。 And then you inherit the label, or the value, that that point has.

非常简单。 Very simplistic.

这是一个分类的例子。 Here is a case of classification.

数据集里有红色加号和蓝色圆圈。 The data set are the red pluses and the blue circles.

而我要做的是，我应用这个分类规则 And what I am doing is that I am applying this rule of classifying 对该平面上的每个点进行分类，即X，输入空间， every point on this plane, which is X, the input space, 根据训练集中最近点的标签。 according to the label of the nearest point within the training set.

根据训练集中最近点的标签。 according to the label of the nearest point within the training set.

如你所见，如果我在这里放一点，这是最接近的。 As you can see, if I take a point here, this is the closest.

这就是这块区域是粉红色的原因。 That's why this is pink.

在这里，它仍然是最接近的。 And here it's still the closest.

一旦我在这里，这个点就变得最近的了。 Once I'm here, this guy becomes the closest.

因此，它变成蓝色。 And therefore, it gets blue.

所以最终结果，就好像你把 So you end up, as a result of that, as if you are breaking 平面变成了一堆单元格。 the plane into cells.

平面变成了一堆单元格。 the plane into cells.

它们中的每一个都具有训练集中 Each of them has the label of a point in the training set that happens 恰好位于单元格中的点的标签。 to be in the cell.

恰好位于单元格中的点的标签。 to be in the cell.

这种如棋盘形状的平面，一旦进入某个单元格， And this tessellation of the plane, into these cells, describes the 就代表了你的决策边界。 boundary for your decisions.

就代表了你的决策边界。 boundary for your decisions.

这是最近邻方法。 This is the nearest-neighbor method.

现在，如果你想使用径向基函数实现它， Now, if you want to implement this using radial basis functions, 有一种方法可以实现。 there is a way to implement it.

有一种方法可以实现。 there is a way to implement it.

它不是这个，但它有类似的效果，你基本上 It's not exactly this, but it has a similar effect, where you basically are 试图影响附近的点。 trying to take an influence of a nearby point.

试图影响附近的点。 trying to take an influence of a nearby point.

这是你唯一考虑的事情。 And that is the only thing you're considering.

你没有考虑其他点。 You are not considering other points.

因此，假设你采用基函数，在这种情况下， So let's say you take the basis function, in this case, 看起来像这样。 to look like this.

看起来像这样。 to look like this.

它不是高斯（分布）， Instead of a Gaussian, 而是一个圆柱形。 it's a cylinder.

而是一个圆柱形。 it's a cylinder.

它仍然是对称的 - 取决于半径。 It's still symmetric-- depends on the radius.

但变化非常简单。 But the dependency is very simple.

我一开始是不变的。 I am constant.

然后我直接降至0。 And then I go to 0.

所以它（下降）非常突然。 So it's very abrupt.

在那种情况下，我并没有完全理解这一点。 In that case, I am not exactly getting this.

但我得到的是一个圆柱形，围绕这一点的那些点 But what I'm getting is a cylinder around every one of those guys that 继承了这一点的值。 inherits the value of that point.

继承了这一点的值。 inherits the value of that point.

显然，存在重叠和诸如此类的问题，这就是 And obviously, there is the question of overlaps and whatnot, and that is 和基函数不同的地方。 what makes a difference from here.

和基函数不同的地方。 what makes a difference from here.

在这两种情况下，它都相当弱。 In both of those cases, it's fairly brittle.

你从这里到这里。 You go from here to here.

会立即更改值。 You immediately change values.

如果这两者之间有点，则不断从蓝色变为红色， And if there are points in between, you keep changing from blue, to red, 再变为蓝色，等等。 to blue, and so on.

再变为蓝色，等等。 to blue, and so on.

在这种情况下，它更脆弱，等等。 In this case, it's even more brittle, and so on.

因此，为了使其不那么突兀，最近邻被修改为 So in order to make it less abrupt, the nearest neighbor is modified to K最近邻，也就是说，你不是取 becoming K nearest neighbors, that is, instead of taking the value of the 最近的点，而是，比方说，找三个最接近的点 closest point, you look for, let's say, for the three closest points, or 或者五个最接近的点，或者七个最近的点，然后投票。 the five closest points, or the seven closest points, and then take a vote.

或者五个最接近的点，或者七个最近的点，然后投票。 the five closest points, or the seven closest points, and then take a vote.

如果他们中的大多数是+1，你认为自己也是+1。 If most of them are +1, you consider yourself +1.

这可以使问题变得更平均一些。 That helps even things out a little bit.

所以中间一个孤立的点，他不属于（投票多的标签） So an isolated guy in the middle, that doesn't belong, gets 就被过滤掉了。 filtered out by this.

就被过滤掉了。 filtered out by this.

这是一种标准的平滑方式， This is a standard way of smoothing, so to 可以这么说，对这个平面。 speak, the surface here.

可以这么说，对这个平面。 speak, the surface here.

从一个点到另一个点仍然会非常突然，但 It will still be very abrupt going from one point to another, but at 至少波动的数量会下降。 least the number of fluctuations will go down.

至少波动的数量会下降。 least the number of fluctuations will go down.

平滑径向基函数的方法，不是 The way you smoothen the radial basis function is, instead of using 使用圆柱，而是使用高斯。 a cylinder, you use a Gaussian.

使用圆柱，而是使用高斯。 a cylinder, you use a Gaussian.

所以现在不是像我有影响，我有影响，我有影响 So now it's not like I have an influence, I have an influence, I have an influence, 我突然没有任何影响了。 I don't have any influence.

我突然没有任何影响了。 I don't have any influence.

现在是，你有影响，你有小一点的影响，你有更小一点 No, you have an influence, you have less influence, you have even less 的影响，最终你没有任何影响因为 influence, and eventually you have effectively no influence because the 高斯分布趋向到0。 Gaussian went to 0.

高斯分布趋向到0。 Gaussian went to 0.

在这两种情况下，你考虑这个模型，无论是最近邻 And in both of those cases, you can consider the model, whether it's nearest 还是K最近邻，还是具有不同基 neighbor or K nearest neighbors, or a radial basis function 的径向基函数。 with different bases.

的径向基函数。 with different bases.

你可以将其视为基于相似性的方法。 You can consider it as a similarity-based method.

你根据它们与训练集中的点的相似程度对 You are classifying points according to how similar they are to points in 点进行分类。 the training set.

点进行分类。 the training set.

并且应用相似性的特定形式是定义算法的定义， And the particular form of applying the similarity is what defines the 无论是这种方式还是那种方式，它是突然的还是 algorithm, whether it's this way or that way, whether it's abrupt or 平滑的，以及诸如此类的东西。 smooth, and whatnot.

平滑的，以及诸如此类的东西。 smooth, and whatnot.

现在让我们看看我们的模型，它是精确插值模型 Now let's look at the model we had, which is the exact-interpolation model 并稍微修改一下，以便处理你可能 and modify it a little bit, in order to deal with a problem that you probably 已经注意到的问题，如下所示。 already noticed, which is the following.

已经注意到的问题，如下所示。 already noticed, which is the following.

在模型中，我们有N个参数，w-- In the model, we have N parameters, w-- 应该是w_1到w_N。 should be w_1 up to w_N.

应该是w_1到w_N。 should be w_1 up to w_N.

它基于N个数据点。 And it is based on N data points.

我有N个参数。 I have N parameters.

我有N个数据点。 I have N data points.

我们有红色的标记，因为现在，你通常 We have alarm bells that calls for a red color, because right now, you usually 会想到泛化，关于数据点和参数的比率， have the generalization in your mind related to the ratio between data 参数或多或少是VC维度。 points and parameters, parameters being more or less a VC dimension.

参数或多或少是VC维度。 points and parameters, parameters being more or less a VC dimension.

因此，在这种情况下，泛化是没有希望的。 And therefore, in this case, it's pretty hopeless to generalize.

它并不像其他情况那样没有希望，因为高斯在这里还算 It's not as hopeless as in other cases, because the Gaussian is a pretty 是友好的。 friendly guy.

是友好的。 friendly guy.

尽管如此，你可能会考虑我将使用 Nonetheless, you might consider the idea that I'm going to use 径向基函数，因此我将产生影响， radial basis functions, so I'm going to have an influence, 对称和所有这些。 symmetric and all of that.

对称和所有这些。 symmetric and all of that.

但我不希望每一点都有自己的影响力。 But I don't want to have every point have its own influence.

我将要做的是，我将选择一些重要的中心点， What I'm going to do, I'm going to elect a number of important centers 在这个数据集里，将这些点作为我的中心，并让它们影响 for the data, have these as my centers, and have them influence the 周围的邻居。 neighborhood around them.

周围的邻居。 neighborhood around them.

所以你做了什么，你选择一个K，K是中心点的数量， So what you do, you take K, which is the number of centers in this 并且K一般比N小得多。这样可以 case, and hopefully it's much smaller than N. so that the generalization

并且K一般比N小得多。这样可以 case, and hopefully it's much smaller than N. so that the generalization 减轻泛化能力的担忧。 worry is mitigated.

减轻泛化能力的担忧。 worry is mitigated.

然后定义中心 - And you define the centers-- 这些是向量， these are vectors, μ_1到μ_K，作为径向基函数的中心，而不是 μ_1 up to μ_k, as the centers of the radial basis functions, instead of x_1到x_N，数据点本身就是中心。 having x_1 up to x_N, the data points themselves, being the centers.

x_1到x_N，数据点本身就是中心。 having x_1 up to x_N, the data points themselves, being the centers.

现在这些点在同一个空间中，比方说 Now those guys live in the same space, let's say in 在一个d维的欧几里德空间中。 a d-dimensional Euclidean space.

在一个d维的欧几里德空间中。 a d-dimensional Euclidean space.

它们完全在同一个空间中，只是它们不是数据点。 These are exactly in the same space, except that they are not data points.

它们不一定是数据点。 They are not necessarily data points.

我们可能选择其中一些作为重要的点，或者我们可能 We may have elected some of them as being important points, or we may have 选出了具有代表性的点，并且与这些点中 elected points that are simply representative, and don't coincide with 的任何一点都不一致。 any of those points.

的任何一点都不一致。 any of those points.

通常这样，我们得到μ_1到μ_K。 Generically, there will be μ_1 up to μ_k.

在这种情况下，径向基函数的函数形式改变 And in that case, the functional form of the radial basis function changes 了，它变成了这样。 form, and it becomes this.

了，它变成了这样。 form, and it becomes this.

我们来看看吧。 Let's look at it.

曾经是我们从1到N，现在从1到K。 Used to be that we are counting from 1 to N, now from 1 to K.

我们有w。 And we have w.

确实，我们的参数更少了。 So indeed, we have fewer parameters.

现在我们正在比较待分类的x，不是和每个点比较， And now we are comparing the x that we are evaluating at, not with every point, 而是跟每个中心。 but with every center.

而是跟每个中心。 but with every center.

并且根据与该中心的距离，即该中心 And according to the distance from that center, the influence of that 的影响，是由w_k贡献的。 particular center, which is captured by w_k is contributed.

的影响，是由w_k贡献的。 particular center, which is captured by w_k is contributed.

你将获得所有中心的贡献加起来，最终获得一个值。 And you take the contribution of all the centers, and you get the value.

和我们之前做的完全相同，除了这个修改，我们 Exactly the same thing we did before except, with this modification, that we 使用的是中心而不是点。 are using centers instead of points.

使用的是中心而不是点。 are using centers instead of points.

所以这里的参数现在很有趣，因为我有w_k So the parameters here now are interesting, because I have w_k's 是参数。 are parameters.

是参数。 are parameters.

我应该经历这整个练习，因为我不想 And I'm supposedly going through this entire exercise because I didn't like 有N个参数。 having N parameters.

有N个参数。 having N parameters.

我只想要K个参数。 I want only K parameters.

但看看我们做了什么。 But look what we did.

μ_k现在是参数，对吗？ μ_k's now are parameters, right?

我不知道他们是什么。 I don't know what they are.

我有K个这样的参数. And I have K of them.

这不用担心，因为我已经说过K比N小得多。 That's not a worry, because I already said that K is much smaller than N.

但它们每个都是一个d维向量，不是吗？ But each of them is a d-dimensional vector, isn't it?

这就是很多参数。 So that's a lot of parameters.

如果我必须估计这些，等等，我没有取得太多进展 If I have to estimate those, et cetera, I haven't done a lot of 在这个练习中。 progress in this exercise.

在这个练习中。 progress in this exercise.

但事实证明，通过一种非常简单的算法，我可以在 But it turns out that I will be able, through a very simple algorithm, to 不触及训练集输出的情况下估计那些， estimate those without touching the outputs of the training set, so 从而不会污染数据。 without contaminating the data.

从而不会污染数据。 without contaminating the data.

这是关键。 That's the key.

两个问题。我如何选择这些中心，这是个有趣的 Two questions. How do I choose the centers, which is an interesting 问题，因为我必须要做选择 -- 如果我想维护 question, because I have to choose it now-- if I want to maintain that the 一个比较少的参数的话 -- number of parameters here is small-- 我不得不选择它，不去考虑y_n的值，即 I have to choose it without really consulting the y_n's, the values of the 训练集的输出值 the output at the training set.

另一个问题是如何选择权重。 And the other question is how to choose the weights.

选择权重不应该与我们之前做的不同。这应该 Choosing the weights shouldn't be that different from what we did before. It will

是一个小的改动，因为它的函数形式相同。 be a minor modification, because it has the same functional form.

这个是有趣的部分， This one is the interesting part, 至少是新颖的部分。 at least the novel part.

至少是新颖的部分。 at least the novel part.

那么让我们谈谈选择中心。 So let's talk about choosing the centers.

我们要做的是，我们将选择中心作为 What we are going to do, we are going to choose the centers as 数据输入的代表。 representative of the data inputs.

数据输入的代表。 representative of the data inputs.

我有N个点。 I have N points.

他们在这里，这里和这里。 They are here, here, and here.

而整个想法是我不想分配径向基函数 And the whole idea is that I don't want to assign a radial basis function 给每一个点。 for each of them.

给每一个点。 for each of them.

因此，我将要做的事情，我将有选一个代表。 And therefore, what I'm going to do, I'm going to have a representative.

这是一个很好的想法，对于附近的每一组点，给一个 It would be nice, for every group of points that are nearby, to have 靠近它们的中心，这样它就可以代表这组点。 a center near to them, so that it captures this cluster.

靠近它们的中心，这样它就可以代表这组点。 a center near to them, so that it captures this cluster.

这就是我的想法。 This is the idea.

所以你现在要取x_n，并取一个最接近它的中心， So you are now going to take x_n, and take a center which is the closest to 并将该点指定给它。 it, and assign that point to it.

并将该点指定给它。 it, and assign that point to it.

这是个主意。 Here is the idea.

我有一些点在附近。 I have the points spread around.

我打算选择中心。 I am going to select centers.

现在还不清楚如何选择中心。 Not clear how do I choose the centers.

但是一旦你选择它们，我将考虑数据集中心 But once you choose them, I'm going to consider the neighborhood of the 的邻域，一堆x_n，作为一个簇， center within the data set, the x_n's, as being the cluster 这个簇以它为中心。 that has that center.

这个簇以它为中心。 that has that center.

如果我这样做，那么这些点由该中心表示， If I do that, then those points are represented by that center, and 因此，我可以说它们的影响将会传播，通过 therefore, I can say that their influence will be propagated through 以此为中心的径向基函数 the entire space by the radial basis function that is 在整个空间中传播。 centered around this one.

在整个空间中传播。 centered around this one.

所以，让我们来做做这个。 So let's do this.

它被称为K均值聚类，因为这些点的中心最终 It's called K-means clustering, because the center for the points will end up 将成为这些点的平均值，正如我们稍后将看到的那样。 being the mean of the points, as we'll see in a moment.

将成为这些点的平均值，正如我们稍后将看到的那样。 being the mean of the points, as we'll see in a moment.

这是它的表述形式。 And here is the formalization.

你将数据点x_1到x_n分成组 -- 也可以说，簇 -- You split the data points, x_1 up to x_n, into groups-- clusters, so to 希望这些点能彼此接近。 speak-- hopefully points that are close to each other.

希望这些点能彼此接近。 speak-- hopefully points that are close to each other.

你将这些称为S_1到S_K。 And you call these S_1 up to S_K.

每个簇都有一个与之相关的中心。 Each cluster will have a center that goes with it.

你要最小化的是，为了能得到一个很好的聚类，以及 And what you minimize, in order to make this a good clustering, and these 良好的代表中心，尽量使这些点 good representative centers, is to try to make the points 靠近他们的中心。 close to their centers.

靠近他们的中心。 close to their centers.

因此你可以对每一点这么做。 So you take this for every point you have.

但你只能对簇内的点进行加和。 But you sum up over the points in the cluster.

所以你取簇内的点，其中心就是这个点（μ_k）。 So you take the points in the cluster, whose center is this guy.

之后，你试图最小化均方误差，均方误差 And you try to minimize the mean squared error there, mean squared error 以欧氏距离表示。 in terms of Euclidean distance.

以欧氏距离表示。 in terms of Euclidean distance.

所以这处理了一个簇，S_k。 So this takes care of one cluster, S_k.

你希望在所有数据上结果也很小。 You want this to be small over all the data.

所以你要做的就是对所有簇进行加和。 So what you do is you sum this up over all the clusters.

这就是聚类的最终目标函数。 That becomes your objective function in clustering.

有人给你K。也就是说，选择实际的簇数 Someone gives you K. That is, the choice of the actual number of 是一个不同的问题。 clusters is a different issue.

是一个不同的问题。 clusters is a different issue.

但是打比方说K是9。 But let's say K is 9.

我给你9个簇。 I give you 9 clusters.

然后，我要求你找到一组μ，并将点分解成 Then, I'm asking you to find the μ's, and the break-up of the points into 一组S_k，使得这个值就是最小值。 the S_k's, such that this value assumes its minimum.

一组S_k，使得这个值就是最小值。 the S_k's, such that this value assumes its minimum.

如果你做到了，那么我可以说这是一个好的聚类， If you succeed in that, then I can claim that this is good clustering, 这些中心点都可以较好的代表每个簇。 and these are good representatives of the clusters.

这些中心点都可以较好的代表每个簇。 and these are good representatives of the clusters.

现在我有一些好消息和一些坏消息。 Now I have some good news, and some bad news.

好消息是，最后，我们进行了无监督学习。 The good news is that, finally, we have unsupervised learning.

我这样做没有引用标签y_n。 I did this without any reference to the label y_n.

我正在接受输入，并输出他们的分类，正如 I am taking the inputs, and producing some organization of them, as we 我们讨论无监督学习的主要目标一样。 discussed the main goal of unsupervised learning is.

我们讨论无监督学习的主要目标一样。 discussed the main goal of unsupervised learning is.

所以我们对此感到高兴。 So we are happy about that.

现在是坏消息。 Now the bad news.

坏消息是，正如我所说， The bad news is that the problem, as I stated it, is 这个问题一般都是NP难的。 NP-hard in general.

这个问题一般都是NP难的。 NP-hard in general.

这是一个很好的无监督问题，但解起来不太友好。 It's a nice unsupervised problem, but not so nice.

如果你想获得绝对最小值，这是很难解的。 It's intractable, if you want to get the absolute minimum.

所以我们现在的目标是绕过它。 So our goal now is to go around it.

这种NP难的问题从未使我们灰心丧气。 That sort of problem being NP-hard never discouraged us.

记得吧，我们学过神经网络， Remember, with neural networks, 我们认为在一般情况下该错误的绝对最小值 - we said that the absolute minimum of that error in the general case-- 要找到它也是NP难的。 finding it would be NP-hard.

要找到它也是NP难的。 finding it would be NP-hard.

最后，我们说我们会找一些启发式算法， And we ended up with saying we will find some heuristic, which was 在这种情况下是梯度下降。 gradient descent in this case.

在这种情况下是梯度下降。 gradient descent in this case.

这让我们学到了反向传播。 That led to backpropagation.

我们将从随机构造开始，然后下降。 We'll start with a random configuration and then descend.

我们将得到，不是全局最小值，因为找到它是NP难的， And we'll get, not to the global minimum, the finding of which is NP-hard, 但可以找到一个局部最小值，但愿它是一个还不错的局部最小。 but a local minimum, hopefully a decent local minimum.

但可以找到一个局部最小值，但愿它是一个还不错的局部最小。 but a local minimum, hopefully a decent local minimum.

我们将在这里做同样的事情。 We'll do exactly the same thing here.

这是解决这个问题的迭代算法，K-means。 Here is the iterative algorithm for solving this problem, the K-means.

它被称为Lloyd算法。 And it's called Lloyd's algorithm.

\red{它非常简单，在这个算法之间对比} It is extremely simple, to the level where the contrast between this 不仅是它的设计，还有它能很快 algorithm-- not only in the specification of it, but how quickly it 的收敛 -- 以及找到NP难问题的全局最小值的能力 converges-- and the fact that finding the global minimums of NP-hard, is 相当令人震惊。 rather mind-boggling.

相当令人震惊。 rather mind-boggling.

所以这就是算法。 So here is the algorithm.

你做的是迭代地减少这个量。 What you do is you iteratively minimize this quantity.

你从某些（初始化）构造开始，并获得更好的构造。 You start with some configuration, and get a better configuration.

如你所见，我现在有两个紫色的项，这是我的参数。 And as you see, I have now two guys in purple, which are my parameters here.

根据定义，μ是参数。 μ's are parameters by definition.

我想找到他们是什么。 I am trying to find what they are.

但是集合S_k，簇也是参数。 But also the sets S_k, the clusters, are parameters.

我想知道哪些点进入了里面。 I want to know which guys go into them.

这是我正在确定的两件事。 These are the two things that I'm determining.

所以这个算法的方式是它固定其中一个，并尝试 So the way this algorithm does it is that it fixes one of them, and tries to 最小化另一个。 minimize the other.

最小化另一个。 minimize the other.

当你知道它每个簇都有哪些点了，你能 It tells you for this particular membership of the clusters, could you 找到最优的中心吗？ find the optimal centers?

找到最优的中心吗？ find the optimal centers?

当你找到了最优的中心 -- Now that you found the optimal centers-- 忘掉之前的聚类 -- 只关注中心， forget about the clustering that resulted in that-- these are centers, 你能找到这些中心的最佳聚类吗？ could you find the best clustering for those centers?

你能找到这些中心的最佳聚类吗？ could you find the best clustering for those centers?

依此循环往复。 And keep repeating back and forth.

我们来看看这些步骤。 Let's look at the steps.

你要最小化这两个， You are minimizing this with respect to both, 所以你一次只拿一个。 so you take one at a time.

所以你一次只拿一个。 so you take one at a time.

现在你更新μ的值。 Now you update the value of mu.

你是怎样做的？ How do you do that?

你使用的聚类是固定的 -- 因此你已经 You take the fixed clustering that you have-- so you have already 继承了上一次迭代生成的聚类。 a clustering that is inherited from the last iteration.

继承了上一次迭代生成的聚类。 a clustering that is inherited from the last iteration.

你要做的是计算每个簇的平均值。 What you do is you take the mean of that cluster.

你取出属于该簇的所有点。 You take the points that belong to that cluster.

把它们加起来除以它们的数目。 You add them up and divide by their number.

现在我们看，你知道这样做肯定是最好的，在最小化 Now in our mind, you know that this must be pretty good in minimizing the 均方误差上，因为均值点处的均方误差 mean squared error, because the squared error to the mean is the smallest 最小。它是最接近所有点的， of the squared errors to any point. That happens to be the closest to the

最小。它是最接近所有点的， of the squared errors to any point. That happens to be the closest to the 从均值平方的角度来看。 points collectively, in terms of mean squared value.

从均值平方的角度来看。 points collectively, in terms of mean squared value.

所以，如果我那样做，我知道这是一个很好的代表，如果这是 So if I do that, I know that this is a good representative, if this was the 真正的簇。 real cluster.

真正的簇。 real cluster.

这是第一步。 So that's the first step.

现在我有了新的μ_k。 Now I have new μ_k's.

所以你冻结了μ_k。 So you freeze the μ_k's.

现在你完全不管之前的聚类。 And you completely forget about the clustering you had before.

现在你要生成新的聚类。 Now you are creating new clusters.

这个想法如下。 And the idea is the following.

你取每个点，然后计算它和μ_k之间的距离， You take every point, and you measure the distance between it and μ_k, the μ_k是新获得的μ_k。 newly acquired μ_k.

μ_k是新获得的μ_k。 newly acquired μ_k.

你会问自己：这是所有μ中最近的么？ And you ask yourself: is the closest of the μ's that I have?

所以你将其与其他进行比较。 So you compare this with all of the other guys.

如果它恰好更小，那么你可以说这个 And if it happens to be smaller, then you declare that this x_n属于S_k。 x_n belongs to S_k.

x_n属于S_k。 x_n belongs to S_k.

你可以为所有点执行此操作，并得到一个完整的聚类。 You do this for all the points, and you create a full clustering.

现在，如果你看一下这一步，我们将论证这可以减少错误。 Now, if you look at this step, we argued that this reduces the error.

它必须是这样，因为对每个点，你选择了平均值，这 It has to, because you picked the mean for every one of them, and that will 肯定不会增加错误。 definitely not increase the error.

肯定不会增加错误。 definitely not increase the error.

这也会减少错误，因为最坏的情况是 This will also decrease the error, because the worst that it can do is 你将一个点从一个簇拿出来后放入另一个簇中。 take a point from one cluster and put it in another.

你将一个点从一个簇拿出来后放入另一个簇中。 take a point from one cluster and put it in another.

但在这样做时，它做了什么？ But in doing that, what did it do?

它选择了最接近的那个。 It picked the one that is closest.

所以之前的这一项现在变小了，因为它选择了 So the term that used to be here is now smaller, because it went to the 更接近的点。 closer guy.

更接近的点。 closer guy.

所以这个减少了错误。 So this one reduces the value.

这个降低了错误。 This one reduces the value.

你循环往复，这个值就会下降。 You go back and forth, and the quantity is going down.

那么最终会收敛吗？ Are we ever going to converge?

是的，必须收敛。因为从结构上说，我们只会处理有限 Yes, we have to. Because by structure, we are only dealing with a finite 数量的点。 number of points.

数量的点。 number of points.

μ的可能值有限，在给定 And there are a finite number of possible values for the μ's, given 算法的情况下，因为它们必须是那些点的平均值。 the algorithm, because they have to be the average of points from those.

算法的情况下，因为它们必须是那些点的平均值。 the algorithm, because they have to be the average of points from those.

所以我有100个点。 So I have 100 points.

将会有一个有限但非常大 There will be a finite, but tremendously big, 数目的可能值。 number of possible values.

数目的可能值。 number of possible values.

但它是有限的。 But it's finite.

我只关心它，它是不是一个有限的数字。 All I care about, it's a finite number.

只要它是有限的，然后我迭代下降，我 And as long as it's finite, and I'm going down, I will 肯定会达到最小。 definitely hit a minimum.

肯定会达到最小。 definitely hit a minimum.

这不是一件永无休止的事情，比方说我取一半， It will not be the case that it's a continuous thing, and I'm doing half, 然后再一半，再一半，永远不会到达。 and then half again, and half of half, and never arrive.

然后再一半，再一半，永远不会到达。 and then half again, and half of half, and never arrive.

这里，你将完美地到达某一点。 Here, you will arrive perfectly at a point.

问题在于，你可能会收敛到一个较好的、我们老生常谈的局部最小。 The catch is that you're converging to good, old-fashioned local minimum.

这取决于你的初始化构造，不同的初始化，你会得到 Depending on your initial configuration, you will end up with 不同的局部最小。 one local minimum or another.

不同的局部最小。 one local minimum or another.

但同样，与神经网络的情况完全相同。 But again, exactly the same situation as we had with neural networks.

我们确实通过反向传播收敛到局部最小值，对吧？ We did converge to a local minimum with backpropagation, right?

而这个最小值取决于初始权重。 And that minimum depended on the initial weights.

在这里，它取决于你的初始中心，或初始聚类， Here, it will depend on the initial centers, or the initial clustering, 无论你想从哪里开始。 whichever way you want to begin.

无论你想从哪里开始。 whichever way you want to begin.

你做的方式是，尝试不同的起点。 And the way you do it is, try different starting points.

你会得到不同的解决方案。 And you get different solutions.

你可以评估哪一个更好，因为你可以肯定地 And you can evaluate which one is better because you can definitely 评估所有这些目标函数，并在多次运行中 evaluate this objective function for all of them, and pick one out of 选择一个。 a number of runs.

选择一个。 a number of runs.

这通常很有效。 That usually works very nicely.

它不会给你全局最优。 It's not going to give you the global one.

但它可以给你一个不错的聚类，一组 But it's going to give you a very decent clustering, and very decent 具有代表性的μ。 representative μ's.

具有代表性的μ。 representative μ's.

现在，让我们来看看Lloyd算法。 Now, let's look at Lloyd's algorithm in action.

我将解决上次向你展示的 And I'm going to take the problem that I showed you last time RBF内核的问题。 for the RBF kernel.

RBF内核的问题。 for the RBF kernel.

这是我们要进行的，因为我们 This is the one we're going to carry through, because we can 现在可以与之相关。 relate to it now.

现在可以与之相关。 relate to it now.

让我们看看算法是如何工作的。 And let's see how the algorithm works.

算法的第一步，我会有一些数据点。 The first step in the algorithm, give me the data points.

好的，谢谢。 OK, thank you.

这些是数据点。 Here are the data points.

如果你还记得，这就是目标。 If you remember, this was the target.

目标略微非线性。 The target was slightly nonlinear.

我们有-1和+1。 We had -1 and +1.

而且我们有这种颜色。 And we have them with this color.

这就是我们拥有的数据。 And that is the data we have.

首先，我只想要输入。 First thing, I only want the inputs.

我没看到标签。 I don't see the labels.

我没有看到目标函数。 And I don't see the target function.

你可能无论如何都看不到目标函数。 You probably don't see the target function anyway.

它太模糊了！ It's so faint!

但实际上，你根本不需要看不到。 But really, you don't see it at all.

所以我现在要去掉目标函数和标签。 So I'm going now to take away the target function and the labels.

我只会保持输入的位置。 I'm only going to keep the position of the inputs.

所以这就是你得到的。 So this is what you get.

现在看起来更加清晰一些了，对吧？ Looks more formidable now, right?

我不知道这个函数是什么。 I have no idea what the function is.

但现在我们意识到一个有趣的观点。 But now we realize one interesting point.

我将对这些点进行聚类、不利用任何标签。 I'm going to cluster those, without any benefit of the label.

所以我可以拥有属于一个类别的簇，+1或-1。 So I could have clusters that belong to one category, +1 or -1.

而且我也可以在边界上进行聚类， And I could, as well, have clusters that happen to be on the boundary, 其中一半是+1，或者一半是-1。 half of them are +1, or half of them -1.

其中一半是+1，或者一半是-1。 half of them are +1, or half of them -1.

这是你在进行无监督学习时付出的代价。 That's the price you pay when you do unsupervised learning.

你正试图获得相似性，相似性只与输入 You are trying to get similarity, but the similarity is as far as the inputs 相关，而不与 are concerned, not as far as the behavior with the 目标函数的行为相关 target function is concerned.

这是关键。 That is key.

所以我有点了。 So I have the points.

接下来我该怎么办？ What do I do next?

你需要初始化中心。 You need to initialize the centers.

有很多方法可以做到这一点。 There are many ways of doing this.

有很多方法。 There are a number of methods.

我会说简单的。 I'm going to keep it simple here.

我将随机初始化这些中心。 And I'm going to initialize the centers at random.

所以我只选9个点。 So I'm just going to pick 9 points.

我选择9是有充分理由的。 And I'm picking 9 for a good reason.

记得上一课，我们学支持向量机时。 Remember last lecture when we did the support vector machines.

我们最终获得了9个支持向量。 We ended up with 9 support vectors.

而且我希望能够比较它们。 And I want to be able to compare them.

所以我会固定一个值，以便能够比较它们。 So I'm fixing the number, in order to be able to compare them head to head.

所以这是我最初的中心。 So here are my initial centers.

完全随机。 Totally random.

看起来像是一个非常愚蠢的事情，让三个中心彼此靠近，而 Looks like a terribly stupid thing to have three centers near each other, and 让这块区域都是空的。但是让我们期待Lloyd算法可以将 have this entire area empty. But let's hope that Lloyd's algorithm will place

让这块区域都是空的。但是让我们期待Lloyd算法可以将 have this entire area empty. But let's hope that Lloyd's algorithm will place 它们放置得更有效一点。 them a little bit more strategically.

它们放置得更有效一点。 them a little bit more strategically.

现在你开始迭代。 Now you iterate.

所以现在我希望你盯着这个。 So now I would like you to stare at this.

我可以把它放大点。 I will even make it bigger.

盯着它，因为我现在要做一个完整的迭代。 Stare at it, because I'm going to do a full iteration now.

我将重新聚类，重新评估μ，然后告诉 I am going to do re-clustering, and re-evaluation of the mu, and then show 你新的μ。 you the new mu.

你新的μ。 you the new mu.

一次一步。 One step at a time.

这是第一步。 This is the first step.

盯着屏幕看。 Keep your eyes on the screen.

他们动了一下。 They moved a little bit.

我很高兴地发现，那些中心点，之前挤在一块，现在 And I am pleased to find that those guys, that used to be crowded, are now 成为了其他点的中心点了。 serving different guys.

成为了其他点的中心点了。 serving different guys.

他们正在散开。 They are moving away.

第二次迭代。 Second iteration.

我不得不说，这不是一次迭代。 I have to say, this is not one iteration.

这些是多次迭代。 These are a number of iterations.

但是我正以一定的速度对它进行采样，以免让你感觉枯燥。 But I'm sampling it at a certain rate, in order not to completely bore you.

这将是 - 点击讲座结束。 It would be-- clicking through the end of the lecture.

然后我们会得到聚类结果，在这堂课结束 And then we would have the clustering at the end of the 时，没别的东西了。 lecture, and nothing else!

时，没别的东西了。 lecture, and nothing else!

所以下一次迭代。 So next iteration.

看看屏幕。 Look at the screen.

运动变得越来越小。 The movement is becoming smaller.

第三次迭代。 Third iteration.

呃。 Uh.

只移动了少许。 Just a touch.

第四。 Fourth.

貌似啥都没发生。 Nothing happened.

但我实际上翻了幻灯片。 I actually flipped the slide.

啥都没发生。 Nothing happened.

真的啥都没发生。 Nothing happened.

所以我们已经收敛了。 So we have converged.

这些是你的μ。 And these are your μ's.

它确实收敛得很快。 And it does converge very quickly.

你现在可以看到这些中心是有意义的。 And you can see now the centers make sense.

这些点有一个中心。 These guys have a center.

这些点有一个中心。 These guys have a center.

这些点，等等。 These guys, and so on.

我想它从这里开始，但是一直在这里徘徊，只能代表两个 I guess since it started here, it got stuck here and is just serving two 点，诸如此类。 points, or something like that.

点，诸如此类。 points, or something like that.

但或多或少，这是一个合理的聚类。 But more or less, it's a reasonable clustering.

尽管事实上，没有自然的 Notwithstanding the fact that there was no natural 聚类，对于这些点来说。 clustering for the points.

聚类，对于这些点来说。 clustering for the points.

这看起来不像是我从9个中心生成了这些点。 It's not like I generated these guys from 9 centers.

看起来更像是均匀生成的。 These were generated uniformly.

所以聚类是偶然的。 So the clustering is incidental.

但是，这里的聚类是有道理的。 But nonetheless, the clustering here makes sense.

现在这是一个聚类，对吧？ Now this is a clustering, right?

惊喜！ Surprise!

我们回到之前的图片。 We have to go back to this.

现在，你看一下聚类，看看会发生什么。 And now, you look at the clustering and see what happens.

这个家伙从+1和-1都能拿到点。 This guy takes points from both +1 and -1.

它们看起来非常相似，因为它只依赖于x。 They look very similar to it, because it only depended on x's.

他们中的许多点都在较深的地方，确实，处理着 Many of them are deep inside and, indeed, deal with points 相同的点。 that are the same.

相同的点。 that are the same.

我之所以提出这个问题，是因为中心会被看作 The reason I'm making an issue of this, because the way the center will serve, 一个中心来影响值的 as a center of influence for affecting the value of the 假设。它会得到一个w_k， hypothesis. It will get a w_k,

假设。它会得到一个w_k， hypothesis. It will get a w_k, 之后它会广播w_k，根据与它的距离。 and then it will propagate that w_k according to the distance from itself.

之后它会广播w_k，根据与它的距离。 and then it will propagate that w_k according to the distance from itself.

所以那些碰巧成为正例与负例点的中心的 So now the guys that happen to be the center of positive and negative 点会引起问题，因为我怎么传播呢？ points will cause me a problem, because what do I propagate?

点会引起问题，因为我怎么传播呢？ points will cause me a problem, because what do I propagate?

+1或-1？ The +1 or the -1?

但实际上，这是你使用无监督学习时所支付的代价。 But indeed, that is the price you pay when you use unsupervised learning.

所以这就是Lloyd算法。 So this is Lloyd's algorithm in action.

现在我要做一些有趣的事情。 Now I'm going to do something interesting.

我们有9个点，这些点是无监督学习的中心，为了 We had 9 points that are centers of unsupervised learning, in order to be 能实施径向基函数的影响， able to carry out the influence of radial basis functions using the 使用我们将拥有的算法。 algorithm we will have.

使用我们将拥有的算法。 algorithm we will have.

这是第一块。 That's number one.

上一讲，我们也是有9个点。 Last lecture, we had also 9 guys.

他们是支持向量。 They were support vectors.

它们是数据点的代表。 They were representative of the data points.

既然这9个点事数据点的代表，那么 And since the 9 points were representative of the data points, and 这里的9个中心代表了数据点，它 the 9 centers here are representative of the data points, it 可以说明它们彼此相邻，以了解什么是 might be illustrative to put them next to each other, to understand what is 共同的，什么是不同的，它们来自哪里，等等。 common, what is different, where did they come from, and so on.

共同的，什么是不同的，它们来自哪里，等等。 common, what is different, where did they come from, and so on.

让我们从RBF中心开始吧。 Let's start with the RBF centers.

他们来了。 Here they are.

我把它们放在标记的数据上，不是我从标记数据中得到 And I put them on the data that is labeled, not that I got them from the 它们，而只是左右相同的图片。 labeled data, but just to have the same picture right and left.

它们，而只是左右相同的图片。 labeled data, but just to have the same picture right and left.

所以这些都是中心所在。 So these are where the centers are.

每个人都清楚地看到它们。 Everybody sees them clearly.

现在让我提醒你支持向量 Now let me remind you of what the support vectors from 之前讲的，是什么样的。 last time looked like.

之前讲的，是什么样的。 last time looked like.

这是支持向量。 Here are the support vectors.

确实非常有趣。 Very interesting, indeed.

支持向量显然在这里，在这附近。 The support vectors obviously are here, all around here.

他们没有任何兴趣来代表簇的点。 They had no interest whatsoever in representing clusters of points.

那不是他们的工作。 That was not their job.

在这里，这些点绝对和 Here these guys have absolutely nothing to do with 分离平面无关。 the separating plane.

分离平面无关。 the separating plane.

他们甚至不知道这里有分离平面。 They didn't even know that there was separating surface.

他们只关注数据。 They just looked at the data.

你基本上得到了你打算做的事情。 And you basically get what you set out to do.

在这里，你要找到输入数据的代表点。 Here you were representing the data inputs.

并且你已经获得了输入数据的代表。 And you've got a representation of the data inputs.

在这里，你试图获得分离平面。 Here you were trying to capture the separating surface.

这就是支持向量的作用。 That's what support vectors do.

它们支撑分离平面。 They support the separating surface.

这就是你得到的。 And this is what you got.

这些点是泛化的中心。 These guys are generic centers.

他们都是黑色。 They are all black.

这些点，有一些蓝色和一些红色，因为它们是支持 These guys, there are some blue and some red, because they are support 向量，并带有标签，因为存在值y_n。 vectors that come with a label, because of the value y_n.

向量，并带有标签，因为存在值y_n。 vectors that come with a label, because of the value y_n.

所以他们中的一些点在这边。 So some of them are on this side.

一些在这边。 Some of them are on this side.

事实上，它们服务于完全不同的目的。 And indeed, they serve completely different purposes.

这是非常值得注意的，我们使用相同的内核得到了两种截然不同的解决方案， And it's rather remarkable that we get two solutions using the same kernel, 内核都是REF内核，这是一种极其不同的多样化 which is RBF kernel, using such an incredibly different diversity of 方法。 approaches.

方法。 approaches.

这只是为了向你展示差异，当你进行 This was just to show you the difference between when you do the 无监督方式时选择的要点，以及有监督方式 choice of important points in an unsupervised way, and here patently in 选择的模式。 a supervised way.

选择的模式。 a supervised way.

选择支持向量非常依赖于 Choosing the support vectors was very much dependent on the 目标的值。 value of the target.

目标的值。 value of the target.

你需要注意的另一件事是，支持向量 The other thing you need to notice is that the support vectors have to be 必须是数据中的点。 points from the data.

必须是数据中的点。 points from the data.

而这里μ的不是来自数据的点。 The μ's here are not points from the data.

他们是这些点的平均值。 They are average of those points.

但他们可以是任意点。 But they end up anywhere.

所以，如果你去看，例如，在这三点。 So if you actually look, for example, at these three points.

你去这里 You go here.

其中一个成为了中心，其中一个成为支持向量。 And one of them became a center, one of them became a support vector.

另一方面，这里不存在这一点。 On the other hand, this point doesn't exist here.

它可以是平面上的任意点，只是碰巧成为了中心。 It just is a center that happens to be anywhere in the plane.

所以现在我们有了这些中心。 So now we have the centers.

我会给你数据。 I will give you the data.

我告诉你K等于9。 I tell you K equals 9.

你用Lloyd算法去计算。 You go and you do your Lloyd's algorithm.

当你算出这些中心，问题的一半 And you come up with the centers, and half the problem of the 已经解决了。 choice is now solved.

已经解决了。 choice is now solved.

它是一大半，因为中心是d维的向量。 And it's the big half, because the centers are vectors of d dimensions.

现在我找到了中心，甚至没有用到标签。 And now I found the centers, without even touching the labels.

我没用到y_n。 I didn't touch y_n.

所以我知道我没有污染任何东西。 So I know that I didn't contaminate anything.

事实上，我只用到了权重，恰好是K个权重， And indeed, I have only the weights, which happen to be K weights, 以确定使用标签。 to determine using the labels.

以确定使用标签。 to determine using the labels.

因此，我对泛化抱有很大希望。 And therefore, I have good hopes for generalization.

现在我看这里。我冻结了它 - Now I look at here. I froze it-- 它变黑了，因为它被选中了。 it became black now, because it has been chosen.

它变黑了，因为它被选中了。 it became black now, because it has been chosen.

而现在我只是想选择这些点， And now I'm only trying to choose these guys, w_k，这是y_n。 w_k. This is y_n.

w_k，这是y_n。 w_k. This is y_n.

我问自己同样的问题。 And I ask myself the same question.

如果可以的话，我希望所有数据点都适用。 I want this to be true for all the data points if I can.

我会问自己：有多少个方程，有多少未知数？ And I ask myself: how many equations, how many unknowns?

我最终得到N个方程式。 I end up with N equations.

一样。 Same thing.

我希望所有数据点都适用。 I want this to be true for all the data points.

我有N个数据点。 I have N data points.

所以我有N个方程式。 So I have N equations.

有多少未知数？ How many unknowns?

未知数这些w。 The unknowns are the w's.

我有K个. And I have K of them.

哦，K是小于N的。我的方程式比未知数多。 And oops, K is less than N. I have more equations than unknowns.

所以必须给一些条件。 So something has to give.

这个条件是必须要给的。 And this fellow is the one that has to give.

这就是我所能想到的。 That's all I can hope for.

我将以平均意义来接近它，正如我们之前所做的那样。 I'm going to get it close, in a mean squared sense, as we have done before.

我不认为你会对这张幻灯片中的任何内容感到惊讶。 I don't think you'll be surprised by anything in this slide.

你以前见过这个。 You have seen this before.

所以我们这样做吧。 So let's do it.

现在这是矩阵Φ。 This is the matrix Φ now.

这是一个新的Φ。 It's a new Φ.

它有K列和N行。 It has K columns and N rows.

因此，根据我们的标准，K小于N，这是一个很高的矩阵。 So according to our criteria that K is smaller than N, this is a tall matrix.

你将它乘以w，即K的权重。 You multiply it by w, which are K weights.

你应该得到一个接近y的值。 And you should get approximately y.

你能解决这个问题吗？ Can you solve this?

是的，我们之前在线性回归中已经这样做了。 Yes, we have done this before in linear regression.

你所需要的只是确保Φ的转置乘以Φ是可逆的。 All you need is to make sure that Φ transposed Φ is invertible.

在这些条件下，你有一步解决方案，即 And under those conditions, you have one-step solution, which 伪逆。 is the pseudo-inverse.

伪逆。 is the pseudo-inverse.

你获得Φ转置乘以Φ的逆，再乘以Φ转置乘y。 You take Φ transposed Φ to the -1, times Φ transposed y.

最终你会得到w的值，它能最小化 And that will give you the value of w that minimizes the mean squared 这些点之间的均方误差。 difference between these guys.

这些点之间的均方误差。 difference between these guys.

所以你使用了伪逆，而不是精确的插值。 So you have the pseudo-inverse, instead of the exact interpolation.

在这种情况下，你无法保证在 And in this case, you are not guaranteed that you will get the 每个数据点都能获得正确的值。 correct value at every data point.

每个数据点都能获得正确的值。 correct value at every data point.

因此，你将犯一些样本内错误。 So you are going to be making an in-sample error.

但我们知道这不是一件坏事。 But we know that this is not a bad thing.

另一方面，我们只确定K个权重。 On the other hand, we are only determining K weights.

所以泛化的可能性很大。 So the chances of generalization are good.

现在，我想把它放进一个图形网络。 Now, I would like to take this, and put it as a graphical network.

这将有助于我将其与神经网络联系起来。 And this will help me relate it to neural networks.

这是第二个链接。 This is the second link.

我们已经将RBF与最近邻方法，相似性方法相关联。 We already related RBF to nearest neighbor methods, similarity methods.

现在我们将它与神经网络联系起来。 Now we are going to relate it to neural networks.

让我先把图表展示出来。 Let me first put the diagram.

这是我对它的说明。 Here's my illustration of it.

我有x。 I have x.

我正在计算径向方面，从μ_1到μ_k的距离，然后 I am computing the radial aspect, the distance from μ_1 up to μ_k, and then 将其交给一个非线性（函数），在这种情况下是高斯非线性。 handing it to a nonlinearity, in this case the Gaussian nonlinearity.

将其交给一个非线性（函数），在这种情况下是高斯非线性。 handing it to a nonlinearity, in this case the Gaussian nonlinearity.

你可以拥有其他基函数。 You can have other basis functions.

就像我们之前例子里讲的圆柱形。 Like we had the cylinder in one case.

但是圆柱形有点极端。 But cylinder is a bit extreme.

但还有其他函数。 But there are other functions.

你将获得与权重相结合的特征，以便 You get features that are combined with weights, in order to 为你提供输出。 give you the output.

为你提供输出。 give you the output.

这个可能是一个传递总和，如果你在进行回归 Now this one could be just passing the sum if you're doing regression, could 也可能是一个硬阈值，如果你正在进行分类， be hard threshold if you're doing classification, could 也可能是其他东西。 be something else.

也可能是其他东西。 be something else.

但我关心的是这种配置对我们来说很熟悉。 But what I care about is that this configuration looks familiar to us.

这是层。 It's layers.

我选择特征。 I select features.

然后我得到输出。 And then I go to output.

我们来看看这些特征。 Let's look at the features.

特征是这些，对吧？ The features are these fellows, right?

现在，如果你看一下这些特征，它们依赖于D -- μ，一般来说， Now if you look at these features, they depend on D-- mu, in general, are 它们都是参数。 parameters.

它们都是参数。 parameters.

如果我没有这个精炼的Lloyd算法、K-means以及 If I didn't have this slick Lloyd's algorithm, and K-means, and 无监督的东西，我需要确定这些是什么。 unsupervised thing, I need to determine what these guys are.

无监督的东西，我需要确定这些是什么。 unsupervised thing, I need to determine what these guys are.

一旦确定它们，特征的值取决于数据集。 And once you determine them, the value of the feature depends on the data set.

而当特征的值取决于数据集时， And when the value of the feature depends on the data set, 所有的东西都确定了。 all bets are off.

所有的东西都确定了。 all bets are off.

它不再是线性模型，非常类似于神经网络中的 It's no longer a linear model, pretty much like a neural network doing the 第一层，提取特征。 first layer, extracting the features.

第一层，提取特征。 first layer, extracting the features.

现在好处是，因为我们只使用输入来 Now the good thing is that, because we used only the inputs in order to 计算μ，所以它几乎是线性的。 compute mu, it's almost linear.

计算μ，所以它几乎是线性的。 compute mu, it's almost linear.

我们利用了伪逆，因为在这种情况下， We've got the benefit of the pseudo-inverse because in this case, 我们不必返回并调整μ，如果你不喜欢输出 we didn't have to go back and adjust mu because you don't like the value of 的值。 the output.

的值。 the output.

这些根据输入数据是永久冻结的。 These were frozen forever based on inputs.

然后，我们只需要得到w。 And then, we only had to get the w's.

w现在看起来像是乘法因子，在这种情况下， And the w's now look like multiplicative factors, in which case 在那些w上，它是线性的。 it's linear on those w's.

在那些w上，它是线性的。 it's linear on those w's.

我们得到了解决方案。 And we get the solution.

现在在径向基函数中，通常会添加偏差项。 Now in radial basis functions, there is often a bias term added.

你不仅要得到那些。 You don't only get those.

你还得到w_0或b。 You get either w_0 or b.

它进入最后一层。 And it enters the final layer.

所以你只需添加另一个权重，这次是，乘以1。 So you just add another weight that is, this time, multiplied by 1.

其他一切都保持不变。 And everything remains the same.

由于这个原因，Φ矩阵有另一列。 The Φ matrix has another column because of this.

你只需要做以前做的那样。 And you just do the machinery you had before.

现在让我们将其与神经网络进行比较。 Now let's compare it to neural networks.

这是RBF网络。 Here is the RBF network.

我们刚看到它。 We just saw it.

我用红色指出了x。 And I pointed x in red.

这是传递给它的内容，获取特征，并获得输出。 This is what gets passed to this, gets the features, and gets you the output.

这是一个结构相当的神经网络。 And here is a neural network that is comparable in structure.

所以你从输入开始。 So you start with the input.

你从输入开始。 You start with the input.

现在你计算特征。 Now you compute features.

在这里你做到了。 And here you do.

这里的特征取决于距离。 And the features here depend on the distance.

它们是这样的，当距离很大时，影响就会消失。 And they are such that, when the distance is large, the influence dies.

因此，如果你查看此值，并且此值很大，你就知道此 So if you look at this value, and this value is huge, you know that this 特征的贡献为0。 feature will have 0 contribution.

特征的贡献为0。 feature will have 0 contribution.

在这里，无论大小，这里都会经历一个sigmoid。 Here this guy, big or small, is going to go through a sigmoid.

所以它可能是巨大的，小的，负的。 So it could be huge, small, negative.

它经历了这个。 And it goes through this.

所以它总是有所贡献。 So it always has a contribution.

因此，一种解释是，径向基函数网络的作用是 So one interpretation is that, what radial basis function networks do, is 查看空间中的局部区域，关心它们的影响，而不关心 look at local regions in the space and worry about them, without worrying 遥远的点（的影响）。 about the far-away points.

遥远的点（的影响）。 about the far-away points.

我有一个函数在这个空间中。 I have a function that is in this space.

我看这部分，我想学习它。 I look at this part, and I want to learn it.

所以我得到了一个捕获它的基函数，或者 So I get a basis function that captures it, or 它们中的一些，等等。 a couple of them, et cetera.

它们中的一些，等等。 a couple of them, et cetera.

而且我知道，当我去到空间的另一部分时，无论我 And I know that by the time I go to another part of the space, whatever I 在这里做了什么都不会干扰，而在另一个情况， have done here is not going to interfere, whereas in the other case 在神经网络中，它确实干扰了非常非常多。 of neural networks, it did interfere very, very much.

在神经网络中，它确实干扰了非常非常多。 of neural networks, it did interfere very, very much.

而你实际上有一些有趣的方式，就是确保 And the way you actually got something interesting, is making sure that the 你所拥有的组合能够给你你想要的东西。 combinations of the guys you got give you what you want.

你所拥有的组合能够给你你想要的东西。 combinations of the guys you got give you what you want.

\red{但在这种情况下，它不是本地的。} But it's not local as it is in this case.

所以这是第一个观察。 So this is the first observation.

第二个观察是，在这里，非线性我们称之为Φ。 The second observation is that here, the nonlinearity we call Φ.

这里相应的非线性是θ。 The corresponding nonlinearity here is θ.

然后你结合了w，你得到了h。 And then you combine with the w's, and you get h.

所以非常相同，除了你在 So very much the same, except the way you extract 这里提取特征的方式是不同的。 features here is different.

这里提取特征的方式是不同的。 features here is different.

这里w是充分训练的参数，依赖于标签。 And w here was full-fledged parameter that depended on the labels.

我们使用反向传播来获得这些。 We use backpropagation in order to get those.

所以这些都是学习好的特征，这使它完全 So these are learned features, which makes it completely 不是一个线性模型。 not a linear model.

不是一个线性模型。 not a linear model.

这个，如果我们根据它们对输出的影响来学习μ，这将 This one, if we learned μ's based on their effect on the output, which would 是一个非常令人讨厌的算法，情况就是如此。 be a pretty hairy algorithm, that would be the case.

是一个非常令人讨厌的算法，情况就是如此。 be a pretty hairy algorithm, that would be the case.

但我们没有。 But we didn't.

因此，这部分几乎是线性的。 And therefore, this is almost linear in this part.

这就是我们固定此部分的原因。 And this is why we got this part fixed.

然后我们使用伪逆获得了这个。 And then we got this one using the pseudo inverse.

最后一点，这是一个双层网络。 One last thing, this is a two-layer network.

这是一个双层网络。 And this is a two-layer network.

几乎任何这种结构的双层网络都可以 And pretty much any two-layer network, of this type of structure, lends itself 作为支持向量机。 to being a support vector machine.

作为支持向量机。 to being a support vector machine.

第一层负责内核。 The first layer takes care of the kernel.

第二个是线性组合，内置于 And the second one is the linear combination that is built-in in 支持向量机。 support vector machines.

支持向量机。 support vector machines.

因此，你可以通过选择内核来解决支持向量机。 So you can solve a support vector machine by choosing a kernel.

你可以在脑海中想象我有其中一个，第一 And you can picture in your mind that I have one of those, where the first 部分是获取内核。 part is getting the kernel.

部分是获取内核。 part is getting the kernel.

第二部分是线性部分。 And the second part is getting the linear part.

事实上，你可以实现神经网络， And indeed, you can implement neural networks using 通过使用支持向量机。 support vector machines.

通过使用支持向量机。 support vector machines.

有一个支持向量机的神经网络内核。 There is a neural-network kernel for support vector machines.

但是它只涉及两层，正如你在这里看到的那样，而不是像 But it deals only with two layers, as you see here, not multiple layers as 普通神经网络那样的多层。 the general neural network would do.

普通神经网络那样的多层。 the general neural network would do.

现在，这里选择的最后一个参数是γ，即高斯的宽度。 Now, the final parameter to choose here was γ, the width of the Gaussian.

我们现在将其视为真正的参数。 And we now treat it as a genuine parameter.

所以我们想要学习它。 So we want to learn it.

因此，它变成了紫色。 And because of that, it turned purple.

据Lloyd算法，现在μ已经固定了。 So now mu is fixed, according to Lloyd.

现在我的参数是w_1到w_K。 Now I have parameters w_1 up to w_K.

然后我也有γ。 And then I have also γ.

而且你可以看到这实际上非常重要，因为正如你所看到的那样， And you can see this is actually pretty important because, as you saw, 如果我们选错了，插值会变得非常糟糕。 that if we choose it wrong, the interpolation becomes very poor.

如果我们选错了，插值会变得非常糟糕。 that if we choose it wrong, the interpolation becomes very poor.

它确实取决于数据集中的间距。 And it does depend on the spacing in the data set.

因此，选择γ可能是一个好主意，以便 So it might be a good idea to choose γ in order to also minimize 最大限度地减少样本内错误 - 获得良好的性能。 the in-sample error-- get good performance.

所以当然，我可以这样做 - So of course, I could do that-- 我可以为我所关心的w，以及所有这样做 - and I could do it for w for all I care-- 我可以为所有参数做到这一点，因为这是值。 I could do it for all the parameters, because here is the value.

我可以为所有参数做到这一点，因为这是值。 I could do it for all the parameters, because here is the value.

我正在最小化均方误差。 I am minimizing mean squared error.

因此，我将把它与y_n的值进行比较， So I'm going to compare this with the value of the y_n 当我插入x_n时。 when I plug-in x_n.

当我插入x_n时。 when I plug-in x_n.

我得到一个样本内错误，这是均方的。 And I get an in-sample error, which is mean squared.

我总能找到最小化的参数，使用梯度下降， I can always find parameters that minimize that, using gradient descent, 最常见的一个。 the most general one.

最常见的一个。 the most general one.

从随机值开始，然后下降，然后你得到一个解决方案。 Start with random values, and then descend, and then you get a solution.

然而，这样做会很遗憾，因为这些有这样 However, it would be a shame to do that, because these guys have such 一个简单的算法可以用。 a simple algorithm that goes with them.

一个简单的算法可以用。 a simple algorithm that goes with them.

如果γ是固定的，这将是小菜一碟。 If γ is fixed, this is a snap.

你做了伪逆，你得到了那个。 You do the pseudo-inverse, and you get exactly that.

所以将这个分开是一个好主意。 So it is a good idea to separate that for this one.

它在指数函数中，这样那样。 It's inside the exponential, and this and that.

我认为我没有找到捷径的希望。 I don't think I have any hope of finding a short cut.

我可能不得不为这个γ做梯度下降。 I probably will have to do gradient descent for this guy.

但我不妨为这个γ做梯度下降， But I might as well do gradient descent for this guy, 而不是为这些w。 not for these guys.

而不是为这些w。 not for these guys.

这种方式的实现方式是迭代方法。 And the way this is done is by an iterative approach.

你固定一个，并解决其他。 You fix one, and solve for the others.

这堂课的主题貌似一直在讲这个。 This seems to be the theme of the lecture.

在这种情况下，它是一个非常著名的算法 - And in this case, it is a pretty famous algorithm-- 该算法的一种变体。 a variation of that algorithm.

该算法的一种变体。 a variation of that algorithm.

该算法称为EM，期望最大化。 The algorithm is called EM, Expectation-Maximization.

它用于解决混合高斯的情况，我们 And it is used for solving the case of mixture of Gaussians, which we 确实会遇到，除了我们没有将它们称为概率。 actually have, except that we are not calling them probabilities.

确实会遇到，除了我们没有将它们称为概率。 actually have, except that we are not calling them probabilities.

我们称他们为实现目标的基。 We are calling them bases that are implementing a target.

所以就是这个想法。 So here is the idea.

固定γ。 Fix γ.

我们以前做过的。 That we have done before.

我们一直在固定γ。 We have been fixing γ all through.

如果你想基于固定γ求解w，你只需 If you want to solve for w based on fixing γ, you just solve for it 使用伪逆解决它。 using the pseudo-inverse.

使用伪逆解决它。 using the pseudo-inverse.

现在我们有了w。 Now we have w's.

现在你固定它们。 Now you fix them.

他们被冻结了。 They are frozen.

并且你要最小化误差，平方误差，对于γ， And you minimize the error, the squared error, with respect to γ, 一个参数。 one parameter.

一个参数。 one parameter.

对于一个参数，梯度下降非常容易。 It would be pretty easy to gradient descent with respect to one parameter.

你找到了最小。 You find the minimum.

你找到γ。 You find γ.

冻结它。 Freeze it.

现在，回到第一步，找到新的w， And now, go back to step one and find the new w's that go 在已知新γ的情况下。 with the new γ.

在已知新γ的情况下。 with the new γ.

循环往复，收敛会非常非常快。 Back and forth, converges very, very quickly.

然后你将获得w和γ的组合。 And then you will get a combination of both w's and γ.

因为它很简单，你甚至可能会问，为什么 And because it is so simple, you might be even encouraged to say: why do 我们只有一个γ？ we have one γ?

我们只有一个γ？ we have one γ?

我有数据集。 I have data sets.

可能是这些数据点彼此接近，并且一个数据点 It could be that these data points are close to each other, and one data point 很远。 is far away.

很远。 is far away.

现在，如果我在这里有一个必须进一步扩展的中心，以及一个 Now if I have a center here that has to reach out further, and a center here 不必扩展的中心，那么这似乎是一个好主意，为这些点 that doesn't have to reach out, it looks like a good idea to have different 提供不同的γ。 γs for those guys.

提供不同的γ。 γs for those guys.

理所当然的。 Granted.

由于这很简单，现在你需要做的就是设K个 And since this is so simple, all you need to do is now is have K 参数γ_k，所以你的参数的数量变成了原来的两倍。 parameters, γ_k, so you doubled the number of parameters.

参数γ_k，所以你的参数的数量变成了原来的两倍。 parameters, γ_k, so you doubled the number of parameters.

但是参数的数量很少。 But the number of parameters is small to begin with.

现在你完全迈出了第一步。 And now you do the first step exactly.

你固定了向量γ，你得到了这些。 You fix the vector γ, and you get these guys.

而现在我们正在K维空间中进行梯度下降。 And now we are doing gradient descent in a K-dimensional space.

我们之前已经这样做了。 We have done that before.

没什么大不了的。 It's not a big deal.

你可以找到与关于它们的最小值，冻结它们， You find the minimum with respect to those, freeze them, 然循环往复。 and go back and forth.

然循环往复。 and go back and forth.

在这种情况下，你可以根据 And in that case, you adjust the width of the Gaussian according to the 空间中的区域调整高斯的宽度。 region you are in the space.

空间中的区域调整高斯的宽度。 region you are in the space.

现在很快，我将讨论RBF的两个方面， Now very quickly, I'm going to go through two aspects of RBF, one of 其中一个方面将它与内核方法联系起来，我们已经看到了 them relating it to kernel methods, which we already have seen the 它的开头。 beginning of.

它的开头。 beginning of.

我们已将它用作内核。 We have used it as a kernel.

所以我们想比较一下性能。 So we would like to compare the performance.

然后，我将把它与正则化联系起来。 And then, I will relate it to regularization.

有趣的是，正如我所描述的那样，RBF就像直觉，局部， It's interesting that RBF's, as I described them-- like intuitive, local, 影响，所有这一切 - 你会发现它们完全 influence, all of that-- you will find in a moment that they are completely 基于正则化。 based on regularization.

基于正则化。 based on regularization.

这就是它们在函数逼近中首先出现的原因。 And that's how they arose in the first place in function approximation.

让我们来做RBF与其内核版本。 So let's do the RBF versus its kernel version.

上一讲我们有一个内核，它是RBF内核。 Last lecture we had a kernel, which is the RBF kernel.

我们有一个9支持向量的解决方案。 And we had a solution with 9 support vectors.

因此，我们最终得到了一个实现这一目标的解决方案。 And therefore, we ended up with a solution that implements this.

我们来看看吧。 Let's look at it.

我有一个符号函数，它是支持向量机的内置部分。 I am getting a sign that's a built-in part of support vector machines.

它们用于分类。 They are for classification.

我有了这个，在我扩展了z转置z之后，从内核来说。 I had this guy after I expanded the z transposed z, in terms of the kernel.

所以我只总结了支持向量。 So I am summing up over only the support vector.

其中有9个。 There are 9 of them.

这成为我的参数，权重。 This becomes my parameter, the weight.

它碰巧有标签。 It happens to have the sign of the label.

这是有道理的，因为如果我想看到x_n的影响，也可能 That makes sense because if I want to see the influence of x_n, it might as 是x_n的影响与x_n的标签一致。 well be that the influence of x_n agrees with the label of x_n.

是x_n的影响与x_n的标签一致。 well be that the influence of x_n agrees with the label of x_n.

这就是我想要的。 That's how I want it.

如果它是+1，我希望+1传播。 If it's +1, I want the +1 to propagate.

因此，因为alphas在设计上是非负的，所以它们 So because the alphas are non-negative by design, they get their sign from 从点的标签得到它们的符号。 the label of the point.

从点的标签得到它们的符号。 the label of the point.

现在，中心是数据集中的点。 And now the centers are points from the data set.

它们碰巧是支持向量。 They happen to be the support vectors.

我在那里有偏差。 And I have a bias there.

这就是我们的解决方案。 So that's the solution we have.

我们在这里有什么？ What did we have here?

我们有直接的RBF实现，有9个中心。 We had the straight RBF implementation, with 9 centers.

我把符号标成蓝色，因为这不是一个组成部分。 I am putting the sign in blue, because this is not an integral part.

我本可以做一个回归部分。 I could have done a regression part.

但是因为我在这里比较，我将采取标志并认为 But since I'm comparing here, I'm going to take the sign and consider 这是一个分类。 this a classification.

这是一个分类。 this a classification.

我还添加了偏差，也是蓝色，因为这不是一个组成部分。 I also added a bias, also in blue, because this is not an integral part.

但我添加它是为了在这里完全可比。 But I'm adding it in order to be exactly comparable here.

所以这里的项的数量是9。 So the number of terms here is 9.

这里的项的数量是9。 The number of terms here is 9.

我正在添加偏差。 I'm adding a bias.

我正在添加偏差。 I'm adding a bias.

现在这里的参数叫做w，取代了上面的这个。 Now the parameter here is called w, which takes place of this guy.

这里的中心是普通中心，μ_k's。 And the centers here are general centers, μ_k's.

这些不必是数据集中的点，实际上它们 These do not have to be points from the data set, indeed they 很可能不是。 most likely are not.

很可能不是。 most likely are not.

他们在这里发挥作用。 And they play the role here.

所以这两个(x_n和μ_k)。 So these are the two guys.

他们是如何表现的？ How do they perform?

这是关键。 That's the bottom line.

你可以想象么？ Can you imagine?

这与我面前的模型完全相同。 This is exactly the same model in front of me.

在其中一个我做了什么？ And in one of them I did what?

对中心点的无监督学习，然后是伪逆。 Unsupervised learning of centers, followed by a pseudo-inverse.

我使用线性回归进行分类。 And I used linear regression for classification.

这是一条路线。 That's one route.

我在这做了什么？ What did I do here?

最大化间隔， Maximize the margin, 等价于内核，并传给二次 equate with a kernel, and pass to quadratic 规划。 programming.

规划。 programming.

完全不同的路线。 Completely different routes.

最后，我有一个可比较的函数。 And finally, I have a function that is comparable.

那么让我们看看他们的表现如何。 So let's see how they perform.

为了公平对待糟糕的直接RBF实现，数据不能 Just to be fair to the poor straight RBF implementation, the data doesn't 正常聚类。 cluster normally.

正常聚类。 cluster normally.

我选择了9因为我在这里得到了9。 And I chose the 9 because I got 9 here.

因此SVM在这里具有主要优势。 So the SVM has the home advantage here.

这只是一个比较。 This is just a comparison.

我没有优化事物的数量，我没有做任何事情。 I didn't optimize the number of things, I didn't do anything.

所以，如果这个家伙最终表现得更好，那就更好了。 So if this guy ends up performing better, OK, it's better.

SVM很好。 SVM is good.

但在这种比较中，它确实有一些不公平的 But it really has a little bit of unfair advantage in this 优势。 comparison.

优势。 comparison.

但是让我们来看看我们拥有的东西。 But let's look at what we have.

这是数据。 This is the data.

让我放大它，以便你可以看到（分离）平面。 Let me magnify it, so that you can see the surface.

现在让我们从常规RBF开始吧。 Now let's start with the regular RBF.

它们都是RBF，但这是常规的RBF。 Both of them are RBF, but this is the regular RBF.

这是你在做完我所说的一切之后得到的平面，Lloyd算法， This is the surface you get after you do everything I said, the Lloyd, 伪逆，以及诸如此类的东西。 and the pseudo-inverse, and whatnot.

伪逆，以及诸如此类的东西。 and the pseudo-inverse, and whatnot.

你首先意识到样本内误差不是零。 And the first thing you realize is that the in-sample error is not zero.

有些点被错误分类。 There are points that are misclassified.

不足为奇。 Not a surprise.

我只有K个中心。 I had only K centers.

而我正在努力减少均方误差。 And I'm trying to minimize mean squared error.

靠近边界的某些点可能会以某种方式 It is possible that some points, close to the boundary, will go one way 出现。 or the other.

出现。 or the other.

我将信号解释为更接近+1或-1。 I'm interpreting the signal as being closer to +1 or -1.

有时它会交叉。 Sometimes it will cross.

这就是我得到的。 And that's what I get.

这是我得到的。 This is the guy that I get.

这是我上次从SVM获得的。 Here is the guy that I got last time from the SVM.

相当有趣。 Rather interesting.

首先，它更好 - 因为我有利于看绿色， First, it's better-- because I have the benefit of looking at the green, the 微弱的绿线，这是目标。 faint green line, which is the target.

微弱的绿线，这是目标。 faint green line, which is the target.

我绝对更接近绿色曲线，尽管我 And I am definitely closer to the green one, in spite of the fact that I 从未在计算中明确地使用它。 never used it explicitly in the computation.

从未在计算中明确地使用它。 never used it explicitly in the computation.

我只使用了数据，两者都使用相同的数据。 I used only the data, the same data for both.

但这会更好地跟踪它。 But this tracks it better.

它实现零样本内误差。 It does zero in-sample error.

这相当接近。 It's fairly close to this guy.

所以这里有两个来自两个不同世界的解决方案， So here are two solutions coming from two different worlds, 使用相同的内核。 using the same kernel.

使用相同的内核。 using the same kernel.

我认为当你使用这两种方法遇到很多问题时， And I think by the time you have done a number of problems using these two 你会感到很冷淡。 approaches, you have it cold.

你会感到很冷淡。 approaches, you have it cold.

你确切知道发生了什么。 You know exactly what is going on.

你知道无监督学习的后果，以及你 You know the ramifications of doing unsupervised learning, and what you 在不知道标签的情况下选择中心而错过的内容， miss out by choosing the centers without knowing the label, versus the 以及支持向量的优势。 advantage of support vectors.

以及支持向量的优势。 advantage of support vectors.

我承诺的最后一项是RBF与正则化。 The final items that I promised was RBF versus regularization.

事实证明，你可以完全基于正则化推导出RBF。 It turns out that you can derive RBF's entirely based on regularization.

你不是在谈论一点的影响力。 You are not talking about influence of a point.

你不是在谈论任何事情。 You are not talking about anything.

这是从函数逼近 Here is the formulation from function approximation 得到的公式。 that resulted in that.

得到的公式。 that resulted in that.

这就是为什么人们认为RBF是非常有原则的，并且它们 And that is why people consider RBF's to be very principled, and they have 有其优点。 a merit.

有其优点。 a merit.

这是一如既往的模数假定。 It is modulo assumptions, as always.

我们将看到这些假定是什么。 And we will see what the assumptions are.

假如你有一个一维函数。 Let's say that you have a one-dimensional function.

所以你有一个函数。 So you have a function.

你有一堆点，数据点。 And you have a bunch of points, the data points.

而你现在正在做的是你试图进行插值和推断， And what you are doing now is you are trying to interpolate and extrapolate 在这些点之间，以获得整个函数，这就是你 between these points, in order to get the whole function, which is what you 在函数逼近中做的事情 - 你在机器学习中做的 do in function approximation-- what you do in machine learning if your 如果你的函数恰好是一维度的。 function happens to be one-dimensional.

如果你的函数恰好是一维度的。 function happens to be one-dimensional.

在这种情况下你做什么？ What do you do in this case?

通常有两项。 There are usually two terms. 其中一个尝试最小化样本内误差。 One of them you try to minimize the in-sample error.

其中一个尝试最小化样本内误差。 One of them you try to minimize the in-sample error.

另一个是正则化，以确保你的函数不会 And the other one is regularization, to make sure that your function is not 过拟合。 crazy outside.

过拟合。 crazy outside.

这就是我们要做的。 That's what we do.

所以看看样本内误差。 So look at the in-sample error.

这就是你对样本内误差的处理方式，从1 That's what you do with the in-sample error, notwithstanding the 1 over 到N，尽管我已经采用了简化形式。 N, which I took out to simplify the form.

到N，尽管我已经采用了简化形式。 N, which I took out to simplify the form.

你获取假设的值，将其与值y， You take the value of your hypothesis, compare it with the value y, the 目标值，平方值进行比较，这是你在样本中的错误。 target value, squared, and this is your error in sample.

目标值，平方值进行比较，这是你在样本中的错误。 target value, squared, and this is your error in sample.

现在我们将添加平滑约束。 Now we are going to add a smoothness constraint.

并且在这种方法中，平滑约束总是被采用，几乎 And in this approach, the smoothness constraint is always taken, almost 总是被视为对导数的约束。 always taken, as a constraint on the derivatives.

总是被视为对导数的约束。 always taken, as a constraint on the derivatives.

如果我有一个函数，并且我告诉你二阶导数非常 If I have a function, and I tell you that the second derivative is very 大，这意味着什么？ large, what does this mean?

大，这意味着什么？ large, what does this mean?

这意味着 - It means-- 所以你做 - So you do-- 那不平滑。 That's not smooth.

那不平滑。 That's not smooth.

如果我转到三阶导数，它将是变化率， And if I go to the third derivative, it will be the rate of change of 依此类推。 that, and so on.

依此类推。 that, and so on.

所以我可以选择导数。 So I can go for derivatives in general.

如果你能告诉我导数一般不是很大， And if you can tell me that the derivatives are not very large in 那么在我看来，这与平滑度相对应。 general, that corresponds, in my mind, to smoothness.

那么在我看来，这与平滑度相对应。 general, that corresponds, in my mind, to smoothness.

他们制定平滑性的方式是通常 The way they formulated the smoothness is by taking, generically, 采用假设的第k个导数，假设现在是 the k-th derivative of your hypothesis, hypothesis now is x的函数。 a function of x.

x的函数。 a function of x.

我可以对它求导。 I can differentiate it.

我可以求k次导数，假设它以分析的 I can differentiate it k times, assuming that it's parametrized in 方式进行参数化。 a way that is analytic.

方式进行参数化。 a way that is analytic.

现在我要求平方，因为我只关心 And now I'm squaring it, because I'm only interested in the 它的大小。 magnitude of it.

它的大小。 magnitude of it.

而我将要做的是，我要进行积分，从负无穷大到 And what I'm going to do, I'm going to integrate this from minus infinity to 正无穷大。 plus infinity.

正无穷大。 plus infinity.

这将是对第k个导数大小的估计， This will be an estimate of the size of the k-th derivative, 尽管它是平方的。 notwithstanding it's squared.

尽管它是平方的。 notwithstanding it's squared.

如果这个很大，那对于平滑性来说是不好的。 If this is big, that's bad for smoothness.

如果这个很小，这对平滑性有好处。 If this is small, that's good for smoothness.

现在我要提高赌注，并结合不同 Now I'm going to up the ante, and combine the contributions of different 导数的贡献。 derivatives.

导数的贡献。 derivatives.

我将把所有导数与系数结合起来。 I am going to combine all the derivatives with coefficients.

如果你想要它们中的一些，你需要做的只是将这些设置为零 If you want some of them, all you need to do is just set these guys to zero 而不是你没有使用的那些。 for the ones you are not using.

而不是你没有使用的那些。 for the ones you are not using.

通常，你将使用，例如，一阶导数和二阶 Typically, you will be using, let's say, first derivative and second 导数。 derivative.

导数。 derivative.

其余的都是零。 And the rest of the guys are zero.

而且你得到了这样的条件。 And you get a condition like that.

现在你将它乘以λ。 And now you multiply it by λ.

这是正则项参数。 That's the regularization parameter.

并且你尝试在此处最小化增广错误。 And you try to minimize the augmented error here.

λ越大，你得到的 And the bigger λ is, the more insistent you are on 光滑度与拟合度越一致。 smoothness versus fitting.

光滑度与拟合度越一致。 smoothness versus fitting.

我们以前见过所有这些。 And we have seen all of that before.

有趣的是，如果你真的想解，在这个 The interesting thing is that, if you actually solve this under 条件和前提下，并且经过一个令人难以置信的繁琐的数学运算， conditions, and assumptions, and after an incredibly hairy mathematics that 你最终会得到径向基函数。 goes with it, you end up with radial basis functions.

你最终会得到径向基函数。 goes with it, you end up with radial basis functions.

那是什么意思？ What does that mean?

这意味着 - It really means-- 我正在寻找插值。 I'm looking for an interpolation.

我正在寻找插值。 I'm looking for an interpolation.

而且我正在寻找尽可能平滑的插值，在 And I'm looking for as smooth an interpolation as possible, in the sense 这些系数的导数的平方和的意义上。 of the sum of the squares of the derivatives with these coefficients.

这些系数的导数的平方和的意义上。 of the sum of the squares of the derivatives with these coefficients.

最好的插值恰好是高斯插值，这并不令人惊讶。 It's not stunning that the best interpolation happens to be Gaussian.

这就是我们所说的。 That's all we are saying.

所以它出来了。 So it comes out.

这就是为什么它具有更大的可信度， And that's what gives it a bigger credibility as being 因为它本身就是自我正则化的，等等。 inherently self-regularized, and whatnot.

因为它本身就是自我正则化的，等等。 inherently self-regularized, and whatnot.

你知道，这是最平滑的插值。 And you get, this is the smoothest interpolation.

这是径向基函数的一种解释。 And that is one interpretation of radial basis functions.

在这张幻灯片上，我们会停下来，我会回答疑问 On that happy note, we will stop, and I'll take questions 在短暂休息后。 after a short break.

在短暂休息后。 after a short break.

让我们开始Q＆A。 Let's start the Q&A.

主持人：首先，你能再次解释一下SVM如何模拟 MODERATOR: First, can you explain again how does an SVM simulate 一个两层神经网络？ a two-level neural network?

一个两层神经网络？ a two-level neural network?

教授：好的。 PROFESSOR: OK.

看看RBF，你会得到提示。 Look at the RBF, in order to get a hint.

这个特征有什么作用？ What does this feature do?

它实际上在计算内核，对吧？ It actually computes the kernel, right?

所以想想它在实现内核时所做的事情。 So think of what this guy is doing as implementing the kernel.

它的实现是什么？ What is it implementing?

它是θ的实现，是sigmoid函数 It's implementing θ, the sigmoidal function, the tanh in this ，在这种情况下，是tanh函数。 case, of this guy.

，在这种情况下，是tanh函数。 case, of this guy.

现在，如果你把它作为你的内核，并验证它是 Now if you take this as your kernel, and you verify that it is 一个有效的内核 - 在径向基函数的情况下，我们 a valid kernel-- in the case of radial basis functions, we had no 没有问题。 problem with that.

没有问题。 problem with that.

在神经网络的情况下，信不信由你，根据你选择的 In the case of neural networks, believe it or not, depending on your choice of 参数，内核可能是一个对应于 parameters, that kernel could be a valid kernel corresponding to 合法Z空间的有效内核，或者可能是非法内核。 a legitimate Z space, or can be an illegitimate kernel.

合法Z空间的有效内核，或者可能是非法内核。 a legitimate Z space, or can be an illegitimate kernel.

但基本上，你将它用作你的内核。 But basically, you use that as your kernel.

如果它是一个有效的内核，你实施支持向量机制。 And if it's a valid kernel, you carry out the support vector machinery.

那么你会得到什么？ So what are you going to get?

你将获得内核的值，评估不同的数据 You are going to get that value of the kernel, evaluated at different data 点，这些数据点恰好是支持向量。 points, which happen to be the support vectors.

点，这些数据点恰好是支持向量。 points, which happen to be the support vectors.

这些成为你的单位。 These become your units.

然后你可以使用权重来组合它们。 And then you get to combine them using the weights.

那是神经网络的第二层。 And that is the second layer of the neural network.

因此它将以这种方式实现两层神经网络。 So it will implement a two-layer neural network this way.

主持人：在一个真实的例子中，你不是在与支持向量进行比较， MODERATOR: In a real example, where you're not comparing to support vectors, how 你如何选择中心的数量？ do you choose the number of centers?

你如何选择中心的数量？ do you choose the number of centers?

教授：这可能是聚类中最大的问题。 PROFESSOR: This is perhaps the biggest question in clustering.

没有确凿的答案。 There is no conclusive answer.

有很多选择标准，以及这个和那个。 There are lots of information criteria, and this and that.

但这确实是一个悬而未决的问题。 But it really is an open question.

这可能是我能给出的最佳答案。 That's probably the best answer I can give.

在许多情况下，有一个相对明确的标准。 In many cases, there is a relatively clear criterion.

我在看最小化。 I'm looking at the minimization.

如果我将簇增加1，假设平方距离的总和 And if I increase the clusters by one, supposedly the sum of the squared 应该下降，因为我还有一个参数可供使用。 distances should go down, because I have one more parameter to play with.

应该下降，因为我还有一个参数可供使用。 distances should go down, because I have one more parameter to play with.

因此，如果我将数量增加一个，并且目标函数下降， So if I increase the things by one, and the objective function goes down 显著，那么看起来它是有价值的，有必要 significantly, then it looks like it's meritorious, that it was warranted to 添加这个中心。 add this center.

添加这个中心。 add this center.

如果没有，那么也许这不是一个好主意。 And if it doesn't, then maybe it's not a good idea.

有很多类似的启发式方法。 There are tons of heuristics like that.

但这确实是一个难题。 But it is really a difficult question.

好消息是，如果你不完全明白， And the good news is that if you don't get it exactly, it's not 那也不是世界末日。 the end of the world.

那也不是世界末日。 the end of the world.

如果你得到合理数量的聚类，而其余的 If you get a reasonable number of clusters, the rest of 机制都有效。 the machinery works.

机制都有效。 the machinery works.

而且你获得了相当不错的表现。 And you get a fairly comparable performance.

很少会出现绝对命中，就就需要的簇数而言， Very seldom that there is an absolute hit, in terms of the number of clusters 如果目标是稍后将其插入到 that are needed, if the goal is to plug them in later on for the rest of the RBF机器的其余部分中。 RBF machinery.

RBF机器的其余部分中。 RBF machinery.

主持人：所以交叉验证可能会有用 - MODERATOR: So cross-validation would be useful for-- 教授：（交叉）验证可能是一种方法。 PROFESSOR: Validation would be one way of doing it.

教授：（交叉）验证可能是一种方法。 PROFESSOR: Validation would be one way of doing it.

但是有很多东西需要验证，但这 But there are so many things to validate with respect to, but this is 肯定是其中之一。 definitely one of them.

肯定是其中之一。 definitely one of them.

主持人：此外，在实际应用中，RBF MODERATOR: Also, is RBF practical in applications where there's a high 是否适用于高维的输入空间？ dimensionality of the input space?

是否适用于高维的输入空间？ dimensionality of the input space?

我的意思是，Lloyd算法是否存在高维度问题？ I mean, does Lloyd algorithm suffer from high dimensionality problems?

教授：是的，这是一个问题 - 距离变得有趣， PROFESSOR: Yeah, it's a question of-- distances become funny, 或者，稀疏变得有趣，在更高维度的空间中。 or sparsity becomes funny, in higher-dimensional space.

或者，稀疏变得有趣，在更高维度的空间中。 or sparsity becomes funny, in higher-dimensional space.

因此，选择γ和其他参数的问题变得更加重要。 So the question of choice of γ and other things become more critical.

如果它真的是非常高维度的空间，并且你有很少 And if it's really very high-dimensional space, and you have few 的点，那么期望良好的插值变得非常困难。 points, then it becomes very difficult to expect good interpolation.

的点，那么期望良好的插值变得非常困难。 points, then it becomes very difficult to expect good interpolation.

所以有困难。 So there are difficulties.

但困难是固有的。 But the difficulties are inherent.

在这种情况下，维度诅咒是固有的。 The curse of dimensionality is inherent in this case.

我认为这对RBF并不特别。 And I think it's not particular to RBF's.

你使用其他方法。 You use other methods.

也会遇到类似的问题。 And you also suffer from one problem or another.

主持人：你能再次回顾一下如何选择gamma吗？ MODERATOR: Can you review again how to choose gamma?

教授：好的。 PROFESSOR: OK.

这是一种方法。 This is one way of doing it.

让我 - Let me-- 在这里，我试图利用一个事实，那就是确定参数子集 Here I am trying to take advantage of the fact that determining a subset 是很容易的。 of the parameters is easy.

是很容易的。 of the parameters is easy.

如果我没有这个，我会平等对待所有参数， If I didn't have that, I would have treated all the parameters on equal 我会使用一般的非线性优化，如 footing, and I would have just used a general nonlinear optimization, like 梯度下降，以便一次性找到所有这些，迭代直到我 gradient descent, in order to find all of them at once, iteratively until I 关于所有这些，收敛到一个局部最小。 converge to a local minimum with respect to all of them.

关于所有这些，收敛到一个局部最小。 converge to a local minimum with respect to all of them.

现在我意识到当γ固定时，有一个非常简单的方法 Now that I realize that when γ is fixed, there is a very simple way in 可以一步到位。 one step to get to the w's.

可以一步到位。 one step to get to the w's.

我想利用这一点。 I would like to take advantage of that.

我将利用的方法是，将变量分为 The way I'm going to take advantage of it, is to separate the variables into 两组，即期望和最大化， two groups, the expectation and the maximization, 根据EM算法。 according to the EM algorithm.

根据EM算法。 according to the EM algorithm.

当我固定其中一个时，当我固定γ时，我可以 And when I fix one of them, when I fix γ, then I can 直接解出w_k。 solve for w_k's directly.

直接解出w_k。 solve for w_k's directly.

我得到了他们。 I get them.

这是一步。 So that's one step.

然后我固定我拥有的w，然后尝试优化 And then I fix w's that I have, and then try to optimize with respect to γ，根据均方误差。 γ, according to the mean squared error.

γ，根据均方误差。 γ, according to the mean squared error.

所以我将w视作常数，γ视作变量， So I take this guy with w's being constant, γ being a variable, and 我将它应用于训练集中的每个点，x_1直到x_N，并将其 I apply this to every point in the training set, x_1 up to x_N, and take it 减去y_n后平方，再将它们相加。 minus y_n squared, sum them up.

减去y_n后平方，再将它们相加。 minus y_n squared, sum them up.

这就是目标函数。 This is an objective function.

然后我会得到它的梯度，并尝试最小化，直到我达到 And then get the gradient of that and try to minimize it, until I get to 局部最小值。 a local minimum.

局部最小值。 a local minimum.

当我达到局部最小值时，现在它是一个局部最小，关于 And when I get to a local minimum, and now it's a local minimum with respect γ的，并且w_k被看作一个常量。 to this γ, and with the w_k's as being constant.

γ的，并且w_k被看作一个常量。 to this γ, and with the w_k's as being constant.

在这些情况下，不存在w_k变化的问题。 There's no question of variation of the w_k's in those cases.

但是我得到了一个γ值，我认为这个值最小。 But I get a value of γ at which I assume a minimum.

现在我冻结它，并重复迭代。 Now I freeze it, and repeat the iteration.

并且循环往复，会比梯度 And going back and forth will be far more efficient than doing gradient 下降更有效，因为一部涉及了如此多的变量， descent in all, just because one of the steps that involves so many variables 是一次性的。 is a one shot.

是一次性的。 is a one shot.

通常，EM算法能非常快速地收敛 And usually, the EM algorithm converges very quickly 到一个非常好的结果。 to a very good result.

到一个非常好的结果。 to a very good result.

这是一种实践中非常成功的算法。 It's a very successful algorithm in practice.

主持人：回到神经网络，现在你提到了 MODERATOR: Going back to neural networks, now that you mentioned the 与SVM的关系。在实际问题中，是否有必要 relation with the SVM's. In practical problems, is it necessary to

与SVM的关系。在实际问题中，是否有必要 relation with the SVM's. In practical problems, is it necessary to 再多一个隐层，或者- have more than one hidden layer, or is it-- 教授：好，就逼近来说，有 PROFESSOR: Well, in terms of the approximation, there is 一个逼近结论会告诉你，你可以逼近任何（函数），通过使用 an approximation result that tells you you can approximate everything using 一个双层神经网络。 a two-layer neural network.

一个双层神经网络。 a two-layer neural network.

这个论点与我们之前提出的论点非常相似。 And the argument is fairly similar to the argument that we gave before.

所以没有必要。 So it's not necessary.

如果你看一下使用神经网络的人，我会说 And if you look at people who are using neural networks, I would say the 少有人使用超过两层。 minority use more than two layers.

少有人使用超过两层。 minority use more than two layers.

所以我不认为两层的限制，由支持 So I wouldn't consider the restriction of two layers dictated by support 向量机表示的，在这种情况下，是一个非常禁止性的限制。 vector machines as being a very prohibitive restriction in this case.

向量机表示的，在这种情况下，是一个非常禁止性的限制。 vector machines as being a very prohibitive restriction in this case.

但有些情况下你需要超过两层，在这种 But there are cases where you need more than two layers, and in that 情况下，你应该去找直接的神经网络， case, you go just for the straightforward neural networks, and 然后你会有一个与之相关的算法。 then you have an algorithm that goes with that.

然后你会有一个与之相关的算法。 then you have an algorithm that goes with that.

这是个内部问题。 There is an in-house question.

学生：嗨，教授。 STUDENT: Hi, professor.

我有一个关于幻灯片的问题。 I have a question about slide one.

为什么我们想出这个径向基函数？ Why we come up with this radial basis function?

你说是因为这个假设受到 You said that because the hypothesis is affected by the data point which is 最接近x的数据点的影响。 closest to x.

最接近x的数据点的影响。 closest to x.

教授：这是你指这张幻灯片，对吧？ PROFESSOR: This is the slide you are referring to, right?

学生：是的。 STUDENT: Yeah.

这张幻灯片。 This is the slide.

那是因为你认为目标函数应该是平滑的吗？ So is it because you assume that the target function should be smooth?

这就是我们可以使用它的原因。 So that's why we can use this.

教授：事后证明，这是隐性的前提， PROFESSOR: It turns out, in hindsight, that this is the underlying 因为当我们考虑用平滑度求解近似 assumption, because when we looked at solving the approximation problem with 问题时，我们最终得到了那些径向基函数。 smoothness, we ended up with those radial basis functions.

问题时，我们最终得到了那些径向基函数。 smoothness, we ended up with those radial basis functions.

还有另一个动机，我没有提到。 There is another motivation, which I didn't refer to.

这是提升它的好机会。 It's a good opportunity to raise it.

假设我有一个数据集，x_1 y1，x_2 y_2，一直到x_N y_N。 Let's say that I have a data set, x_1 y1, x_2 y_2, up to x_N y_N.

我假设会有噪声（数据）。 And I'm going to assume that there is noise.

但这是一个有趣的噪声。 But it's a funny noise.

它不是y值的噪声。 It's not noise in the value y.

它是x值的噪声。 It's noise in the value x.

也就是说，我无法准确测量输入。 That is, I can't measure the input exactly.

我想在学习中考虑到这一点。 And I want to take that into consideration in my learning.

有趣的结果是，如果我假定存在 The interesting ramification of that is that, if I assume that there is 噪声，并且噪声是高斯的，这是一个典型的 noise, and let's say that the noise is Gaussian, which is a typical 假定。 assumption.

假定。 assumption.

虽然这是给我的x，但真正的x可能在这里， Although this is the x that was given to me, the real x could be here, 或者在这里，或者在这里。 or here, or here.

或者在这里，或者在这里。 or here, or here.

而我必须做的事情，既然我在x处有值y，值y And what I have to do, since I have the value y at that x, the value y 本身，将无噪声的，在这种情况下。 itself, I'm going to consider to be noiseless in that case.

本身，将无噪声的，在这种情况下。 itself, I'm going to consider to be noiseless in that case.

我只是不知道它对应的是哪个x。 I just don't know which x it corresponds to.

然后你会发现，当你解决这个问题时，你会意识到你必须 Then you will find that when you solve this, you realize that what you have 做的事情，你必须使你的假设的值变化不大， to do, you have to make the value of your hypothesis not change much by 通过改变x，因为你在冒着错过它的风险。 changing x, because you run the risk of missing it.

通过改变x，因为你在冒着错过它的风险。 changing x, because you run the risk of missing it.

如果你想解决它，你最终实际上有一个插值， And if you solve it, you end up with actually having an interpolation which 在这种情况下是高斯。 is Gaussian in this case.

在这种情况下是高斯。 is Gaussian in this case.

所以你可以在不同的假定下达成同样的目的。 So you can arrive at the same thing under different assumptions.

有很多方法可以看到这一点。 There are many ways of looking at this.

但绝对是平滑的，无论是这种还是那种方式，无论是通过观察这里， But definitely smoothness comes one way or the other, whether by just observing 还是通过观察正则化，异或观察输入噪声 here, by observing the regularization, by observing the input noise 的解释，还是其他的解释。 interpretation, or other interpretations.

的解释，还是其他的解释。 interpretation, or other interpretations.

学生：好的，我明白了。 STUDENT: OK, I see.

另一个问题是关于幻灯片六，当我们选择小γ Another question is about slide six, when we choose small γ 或大γ时。 or large γ.

或大γ时。 or large γ.

是的，在这里。 Yes here.

所以实际上，在这个例子中，我们可以说绝对 So actually here, just from this example, can we say that definitely 小的γ比大γ更好吗？ small γ is better than large γ here?

小的γ比大γ更好吗？ small γ is better than large γ here?

教授：嗯，小是相对的。 PROFESSOR: Well, small is a relative.

所以问题是 - 这与空间中的点之间的距离有关， So the question is-- this is related to the distance between points in the 因为高斯的值将在该空间中衰减。 space, because the value of the Gaussian will decay in that space.

因为高斯的值将在该空间中衰减。 space, because the value of the Gaussian will decay in that space.

如果两点在这里，这看起来很棒。 And this guy looks great if the two points are here.

但同样的亮点如果在这里，会看起来很糟糕，因为 But the same guy looks terrible if the two points are here because, by the 当你到达这里时，它会消失。 time you get here, it will have died out.

当你到达这里时，它会消失。 time you get here, it will have died out.

所以它都是相对的。 So it's all relative.

但相对而言，这会是一个好主意，将高斯的宽度 But relatively speaking, it's a good idea to have the width of the Gaussian 与点之间的距离相匹配，这样才能有真正的 comparable to the distances between the points so that there is a genuine 插值。 interpolation.

插值。 interpolation.

选择γ的客观标准会影响到这一点， And the objective criterion for choosing γ will affect that, 因为当我们求解γ时，我们正在使用K中心。 because when we solve for γ, we are using the K centers.

因为当我们求解γ时，我们正在使用K中心。 because when we solve for γ, we are using the K centers.

所以你有一些点，它们有高斯的中心。 So you have points that have the center of the Gaussian.

但是你需要担心高斯覆盖 But you need to worry about that Gaussian covering the data points that 附近的数据点。 are nearby.

附近的数据点。 are nearby.

因此，你需要把宽度调大或调小，以及 And therefore, you are going to have the widths of that up or down, and the 其他的，以便影响到达那些点。 other ones, such that the influence gets to those points.

其他的，以便影响到达那些点。 other ones, such that the influence gets to those points.

所以好消息是有一个客观的选择标准。 So the good news is that there is an objective criterion for choosing it.

这张幻灯片旨在表明γ很重要。 This slide was meant to make the point that γ matters.

现在重要的是，让我们看看解决它的原则性方法。 Now that it matters, let's look at the principled way of solving it.

另一种方式是解决它的原则性方法。 And the other way was the principled way of solving it.

学生：那么这是否意味着，选择γ是有意义的， STUDENT: So does that mean that choosing γ makes sense when we have 当我们的聚类数少于样本数时？因为在这种情况下，我们有三个 fewer clusters than number of samples? Because in this case, we have three

当我们的聚类数少于样本数时？因为在这种情况下，我们有三个 fewer clusters than number of samples? Because in this case, we have three 类和三个样本。 clusters and three samples.

类和三个样本。 clusters and three samples.

教授：这并不是γ的用途。 PROFESSOR: This was not meant to be a utility for γ.

它只是为了在直观地说明γ的重要性。 It was meant just to visually illustrate that γ matters.

但实际上，主要用的是K中心。 But the main utility, indeed, is for the K centers.

学生：好的，我明白了。 STUDENT: OK, I see.

实际上，在这两种情况下，样本内误差都是零，相同的 Here actually, in both cases, the in-sample error is zero, same 泛化行为。 generalization behavior.

泛化行为。 generalization behavior.

教授：你是完全正确的。 PROFESSOR: You're absolutely correct.

学生：在这个意义上，我们可以说K，聚类的数量是VC STUDENT: So can we say that K, the number of clusters is a measure of VC 维的一种度量吗？ dimension, in this sense?

维的一种度量吗？ dimension, in this sense?

教授：嗯，这是一个因和果的问题。 PROFESSOR: Well, it's a cause and effect.

当我决定簇的数量时，我会确定参数的 When I decide on the number of clusters, I decide on the number of 数量，这将影响VC维。 parameters, and that will affect the VC dimension.

数量，这将影响VC维。 parameters, and that will affect the VC dimension.

所以这就是它的方式，而不是相反的方式。 So this is the way it is, rather than the other way around.

我不希望人们把问题视为：哦，我们想确定 I didn't want people to take the question as: oh, we want to determine 集群的数量，所以让我们来看看VC维度。 the number of clusters, so let's look for the VC dimension.

集群的数量，所以让我们来看看VC维度。 the number of clusters, so let's look for the VC dimension.

那将是本末倒置。 That would be the argument backwards.

所以你的说法是正确的。 So the statement is correct.

他们是相关的。 They are related.

但原因和结果是，你选择的簇数 But the cause and effect is that your choice of the number of clusters 会影响你的假设集的复杂度。 affects the complexity of your hypothesis set.

会影响你的假设集的复杂度。 affects the complexity of your hypothesis set.

学生：不是相反吗？ STUDENT: Not the reverse?

因为我认为，例如，如果你有N个数据，并且我们知道 Because I thought, for example, if you have N data, and we know what 什么样的VC维度会给出良好的泛化，那么基于此， kind of VC dimension will give good generalization, so based on that, can 我们可以做些什么 - we kind of-- 教授：所以这是必要的。 PROFESSOR: So this is out of necessity.

教授：所以这是必要的。 PROFESSOR: So this is out of necessity.

你并不是说这是固有数量的簇，是你 You're not saying that this is the inherent number of clusters that are 执行此操作所需的。 needed to do this.

执行此操作所需的。 needed to do this.

这是我能负担得起的。 This is what I can afford.

学生：是的，这就是我的意思。 STUDENT: Yeah, that's what I mean.

教授：那么在那种情况下，这是真的。 PROFESSOR: And then in that case, it's true.

但在这种情况下，它不是你可以承受的集群数量 - But in this case, it's not the number of clusters you can afford-- 它是间接的 - it is indirectly-- 它是你可以承受的参数数量，由于VC维度。 it is the number of parameters you can afford, because of the VC dimension.

它是你可以承受的参数数量，由于VC维度。 it is the number of parameters you can afford, because of the VC dimension.

因为我有那么多参数，所以我不得不满足于 And because I have that many parameters, I have to settle for that 那些数量的簇，无论它们是否 number of clusters, whether or not they break the data points 正确将数据点分开。 correctly or not.

正确将数据点分开。 correctly or not.

我唯一想避免的是，我不希望人们 The only thing I'm trying to avoid is that I don't want people to 认为这将是一个选择簇数量的最优答案， think that this will carry an answer to the optimal choice of clusters, from 从无监督学习的角度来看。 an unsupervised learning point of view.

从无监督学习的角度来看。 an unsupervised learning point of view.

那种关系不存在。 That link is not there.

学生：我明白了。 STUDENT: I see.

但是因为在这个例子中，我们处理的 - 似乎没有自然的 But because like in this example, we deal with-- it seems there's no natural 簇，在输入样本中，它在输入空间中均匀分布。 cluster in the input sample, it's uniformly distributed in the input space.

簇，在输入样本中，它在输入空间中均匀分布。 cluster in the input sample, it's uniformly distributed in the input space.

教授：正确。 PROFESSOR: Correct.

在许多情况下，即使存在聚类，你也不知道 And in many cases, even if there is clustering, you don't know the 类的固有数量。 inherent number of clusters.

类的固有数量。 inherent number of clusters.

但同样，这里的优点是我们可以做一个差不多的 But again, the saving grace here is that we can do a half-cooked 聚类，只是为了得到一些点的代表，然后 clustering, just to have a representative of some points, and then 让有监督的学习阶段来关心如何让值变正确。 let the supervised stage of learning take care of getting the values right.

让有监督的学习阶段来关心如何让值变正确。 let the supervised stage of learning take care of getting the values right.

所以这只是思考聚类的一种方法。 So it is just a way to think of clustering.

我正在尝试，不是使用所有的点，我 I'm trying, instead of using all the points, I'm 正在尝试使用K中心。 trying to use K centers.

正在尝试使用K中心。 trying to use K centers.

我希望他们尽可能具有代表性。 And I want them to be as representative as possible.

这将使我捷足先登。 And that will put me ahead of the game.

然后真正的测试是当我将其插入有监督的时候。 And then the real test would be when I plug it into the supervised.

学生：好的。 STUDENT: OK. Thank you, professor.

主持人：RBF比SVM更好的例子有么？ MODERATOR: Are there cases when RBF's are actually better than SVM's?

教授：有例子。 PROFESSOR: There are cases.

你可以在许多情况下运行它们，如果数据聚类， You can run them in a number of cases, and if the data is clustered in 以特定方式，并且聚类碰巧具有共同的值，那么 a particular way, and the clusters happen to have a common value, then 你可能会认为执行无监督学习会让我 you would expect that doing the unsupervised learning will get me 领先，而SVM现在是在边界上，它们必须是这样， ahead, whereas the SVM's now are on the boundary and they have to be such that \red{RBF的取消会给我正确的值} the cancellations of RBF's will give me the right value.

所以你绝对可以创造一个胜过另一个的案例。 So you can definitely create cases where one will win over the other.

大多数人都会使用RBF内核，即SVM方法。 Most people will use the RBF kernels, the SVM approach.

主持人：今天就到这里。 MODERATOR: Then that's it for today.

教授：非常好。 PROFESSOR: Very good.

谢谢。 Thank you.

我们下周见。 We'll see you next week.

Loading...

Loading video analysis...