The Kernel Trick

By DataMListic

Summary

## Key takeaways - **SVMs need maximum margin, not just any line.**: Support Vector Classification aims to find a decision boundary that maximizes the margin, which is the distance to the closest points (support vectors) from each class. These support vectors are the only data points crucial for defining the boundary. [00:58], [01:13] - **Transforming data can make non-linear problems linear.**: When data is not linearly separable in its original dimension, it can be transformed into a higher dimension where a linear decision boundary can be found. This is illustrated by mapping 1D data to 2D or 2D data to 3D to achieve separability. [02:03], [03:38] - **High-dimensional transformations are computationally expensive.**: Transforming data to higher dimensions to achieve linear separability can lead to an exponential increase in the number of features, making computations extremely expensive in terms of time and memory. [04:05], [04:44] - **Kernel trick bypasses explicit high-dimensional computation.**: The kernel trick allows us to compute dot products between transformed data points in a high-dimensional space without actually performing the transformation, by using a kernel function that directly calculates this dot product. [04:49], [05:53] - **Kernels map data to different implicit feature spaces.**: Different kernel functions, such as polynomial or Radial Basis Function (RBF), correspond to distinct implicit feature spaces. The RBF kernel, for instance, maps data to an infinite-dimensional space, all without explicit computation. [06:42], [07:18]

Topics Covered

Higher dimensions reveal linear separability in complex data.
Kernel trick computes dot products without explicit transformation.
Polynomial kernels simplify high-dimensional dot product calculations.
Kernel trick enables infinite dimensions without computational cost.

Full Transcript

Hey there. In this video, we'll talk

about the kernel trick. Probably one of

the most fascinating yet confusing

concepts in machine learning and support

vector machines. When people first hear

about the kernel trick, it seems like a

genuine mathematical sorcery. And part

of the confusion comes from the word

kernel itself. It could refer to a

nonparameter way to estimate probability

densities in statistics or the null

space of a linear transformation in

algebra or perhaps the core of an

operating system or even something to do

with nuts and seeds. But don't worry, by

the end of this video, you'll understand

what makes the kernel trick so special

in machine learning. Let's start with a

recap of support vector classification.

Imagine we have some data points in a 2D

space where we have blue points for

class one and red points for class zero.

In support vector classification, we are

trying to find the decision boundary, a

line in this case that separates these

two classes, but not just any line. We

want the line that creates the maximum

margin between the classes. This margin

is the distance between the closest

points from each class to the decision

boundary. In these closest points are

called support vectors and they are the

only data points that actually matter

for defining our boundary. If we were to

move any other point without crossing

the margin, our decision boundary

wouldn't change at all. Mathematically,

our decision boundary is a hyper plane

defined by W dox + b equals zero. Where

w is the normal vector to the hyper

plane, x is the input vector and b is

the bias term. And we classify a new

point based on which side of the hyper

plane it falls. If w dox + b is greater

than zero, we predict class one. And if

w do x + b is lesser than zero, then we

predict class zero. But what happens

when our data isn't linearly separable?

Let's take a simple example in one

dimension. Here we have blue points at

positions -4, minus2, 2 and 4. And red

points at minus3, minus1, 1 and 3. And

as you can see, there is no single point

that can separate the blue and red

classes. And no matter where we put our

decision boundary, we'll always

mclassify some points. This is where

transformations come in. What if instead

of looking at just the x position, we

transform our data to include x squar as

well? Now our 1D data becomes 2D data.

Each data point x becomes x x^2. And

look what happens. Our blue points

become

-46, -3, 9, 39, and 416. while our red

points become -2 4 -1 1 1 and 2 4. And

as you can see now we can draw a

horizontal line at y = 6 for instance

that perfectly separates our classes.

Let's look at an even more complex

example in 2D. Here we have data

arranging concentric circles. The inner

circle is one class and the outer ring

is another class. There is clearly no

straight line that can separate these

classes. But if we transform our data by

adding a third dimension z where z = x²

+ y^2, something magical happens. The

outer ring gets lifted higher than the

inner circle. And now we can separate

the classes with a horizontal plane.

This is the power of transforming our

data to a higher dimension where it

becomes linearly separable. But here is

where we hit a problem. What if we need

many dimensions to make our data

linearly separable? For instance, we

have data with just two features and we

want to use all polomial terms up to

degree 3. We would end up with features

like 1 x1 x2 x1 2 x1 * x2 x2 2 x1 cub x1

2 * x2 x1 * x2 2 and x2 cub that's

already 10 dimensions and it gets

exponentially worse as the original

dimension or polomial degree increases

and computing in these highdimensional

spaces can be extremely expensive in

terms of both time and memory. And

basically this is where the kernel trick

comes in to save the day. In short, the

key insight is this. When training a

support vector machine, we don't

actually need a transform data points

the cells. We only need a dot products

between pairs of transformed data

points. Basically this appears in the

SVM optimization problem as maximize the

sum of alpha I minus 12 * the sum of

alpha I * alpha J * Y I * Y time the

third product of I of X I with XJ. Here

alpha are our LRA multipliers, Y are our

class labels and FE is our

transformation function. And if you've

got confused by this long equation,

don't worry about it. The single term

that matters here is the dotproduct

between five of x i and five of xj. And

now here is the trick. For many useful

transformations, we can compute this

dotproduct directly without ever

computing the transformation. We define

a kernel function k of x and y that

equals to the dotproduct of x with pi of

y. Then we can rewrite our optimization

problem using this kernel function

instead. Let me show you an example with

a polomial kernel of degree 2. If we

fully expand our transform feature

vector and compute the dot product, we

get a big messy expression. But after

simplifying we realize it can be written

as simply 1 + xrpose y^ 2. This means

that instead of transforming our data to

six dimensions and then computing the

dot products, we can just use our

original data and apply this much simple

kernel function. Common kernel functions

include the linear kernel which is just

xrpose y the polomial kernel xrpose y +

c raised to the power of d the radial

basis function kernel which is e to the

power of negative gamma * the square

distance between x and y and the sigmoid

kernel which is the hyperbolic tangent

of gamma * xrpose y + c in each kernel

corresponds to a different implicit

feature space. The polomial kernel maps

to all polomial terms up to degree D

while the RBF kernel maps to an infinite

dimensional space. But thanks to the

kernel trick, we never need to actually

compute in these

spaces. Now let's summarize what we have

learned. When data isn't linearly

separable, we transform it into a higher

dimension where it becomes linearly

separable. And this transformation can

create extremely highdimensional spaces

that would be computationally expensive

to work with directly. The kernel trick

lets us compute dot products in these

highdimensional spaces without ever

transforming the data making support

vector machines practical for complex

nonlinear classification tasks. And this

is why the kernel trick feels like

mathematical magic. we are effectively

working in a highdimensional space

without paying the computational cost.

It's a beautiful example of how

mathematical insight can lead to a

powerful practical

algorithm. And that's it for this video.

I hope you now understand the kaltic

better, what it is, why it works, and

how it makes support vector machines so

powerful. Also, I hope that this

knowledge will help you not just

implement SVMs correctly, but also

understand why they work the way they

do. Please hit the like button if you

found this helpful. Share your thoughts

in the comments below and subscribe for

more content like this. Also, I would

like to give a big thanks to everyone

supporting this channel, including my

Patreons and YouTube members. Consider

joining them if you'd like to help me

create more content. Your support makes

these videos possible. Thanks for

watching and I will see you next time.

Loading...

Loading video analysis...