LongCut logo

【機器學習 Machine Learning】3小時初學者教學 | 人工智慧 AI | Python | 機器學習入門 | 機器學習教學 #AI #ML #深度學習

By GrandmaCan -我阿嬤都會

Summary

## Key takeaways - **AI Enables Human-Like Machine Intelligence**: Artificial intelligence is the goal of making machines have the same intelligence as humans to handle complicated tasks. Machine learning achieves this by finding rules from past data, while deep learning uses neural networks imitating the human brain to extract those rules. [00:40], [01:19] - **Machines Learn Like Children from Examples**: Machines learn by inputting data like pictures of cats and dogs into a model, which initially guesses randomly and corrects based on errors, similar to teaching a child through repeated examples until achieving accuracy. After training, the model can classify new unseen images correctly. [04:12], [07:49] - **Predict House Prices from Multiple Features**: Use square meters, location, and rooms as features to train a model that predicts selling price as the label, correcting guesses against real data until errors are minimal. For a 72-ping house in Washington with 6 rooms, the model estimates 7.3 million. [07:52], [11:15] - **Linear Regression Fits Salary to Seniority**: Simple linear regression represents salary data as y = w*x + b, where x is seniority and y is salary, finding optimal w and b to minimize squared errors. For 2.5 years seniority, it predicts about 51K salary. [16:05], [17:53] - **Cost Function Measures Model Fit**: The cost function calculates the average squared difference between predicted and actual values, forming a parabola where the minimum point indicates optimal parameters. Using gradient descent efficiently finds this minimum by updating parameters along the slope. [34:06], [39:44] - **Gradient Descent Optimizes with Learning Rate**: Gradient descent updates parameters by subtracting the slope times a learning rate, converging faster with moderate rates; too high causes oscillation past the minimum, too low slows progress indefinitely. After iterations, it finds parameters yielding low cost like 32.69. [59:19], [01:07:12]

Topics Covered

  • Machine Learning: Finding Rules in Data
  • Teaching a Child vs. Training a Machine: The Same Learning Process
  • Simple Linear Regression: Predicting Salary from Experience
  • Testing Gradient Descent: Comparing Predictions to Real Data
  • Feature Scaling Solves Gradient Descent Oscillation

Full Transcript

Hello everyone, I am Xiaobai.

This is a machine learning course that combines theory with practice

. We will use Python to complete the practical part.

So if you haven’t learned Python

, you can go to my Python course first.

The theoretical part is because of this It’s not a mathematics class

, so there won’t be too difficult mathematics in it, so don’t worry

. Well, let’s start without talking nonsense.

Hello everyone, I’m Xiaobai.

Welcome

to this course. The first one is artificial intelligence. AI

is full of artificial intelligence.

You should often hear artificial intelligence.

So what is artificial intelligence?

Artificial intelligence or you can call it artificial intelligence.

In a simple sentence

, we want machines Can have the same intelligence as humans.

Once

the machine has

intelligence

, it can help us deal with many complicated things.

Then

what is machine learning ML

? How to let the machine have intelligence?

One of the ways is to let the machine learn

. Now the question

is how to let the machine

learn. We can first think about how humans learn. Can

humans learn from the past history

and the past ? The same is true

for machines

. The past history and past experience of machines are

actually the data stored in the past.

So machine learning

, in a simple sentence

, is to find out from the data stored in the past. Out of the rules

OK Machine learning

is to find out the rules from the data in a simple sentence.

Then what

is deep learning

DL ? Learning

is to find the rules from the data

. There

are many ways to find the rules from the data. One of the methods is deep learning

. The method of deep

learning is to learn from the data

by imitating a neural network of the human brain. Let’s find out the rules OK

, so what is a neural network? I will introduce it to you

later in the course

. Okay, so let’s summarize the artificial intelligence, machine

learning and deep learning we just mentioned.

We can use this picture to do it. One said that

artificial intelligence is the ultimate goal we want to achieve

. We want machines to be as intelligent as humans , so

that they can help us deal with many complicated things

. How to achieve artificial intelligence?

One of the ways is to let machines learn

. In

a simple sentence, machine learning

is to find rules from data. There are many ways

to find rules from data

. One of the most powerful methods is through deep learning

. The method of deep learning

is to imitate the brain. A similar neural network to find out the rules from the data

OK, so this is the relationship between artificial intelligence,

machine

learning and deep learning

. Next, let’s understand how the machine learns

.

The machine has been mentioned before. Learning is equivalent to finding out the rules from the data

. Then we will further discuss

how the machine or computer finds out the rules from the data.

In fact,

it can be achieved by using some mathematical skills and programs to make the machine find out the rules from the data. Out

of the rules, let’s take a look at

what the process of machine learning looks like . Suppose

we have some picture data in our hands . These pictures are pictures of dogs and cats.

We want the machine to learn from the data of these pictures.

Learn how to distinguish between dogs and cats OK

, that is to say, we want the machine to find out the rules for distinguishing dogs and cats

from these materials, so what can we do? You

can first think about how humans learn to distinguish between dogs and cats

Suppose you want to teach a child today.

This child does not know what is a cat and what is a dog

. You have to teach him how to tell the difference.

At the beginning, you may take a picture

and ask the child whether it is just a cat or a dog.

You can see that the eyes of the child are quite different. Because

he doesn’t know what you’re talking about, he just wants to go to bed quickly

, so he casually answered him, saying it’s just a dog,

obviously he got the wrong answer

, then we, as teachers, give him a cross

and tell him

The animal that this child looks like is a cat.

Then we took another picture and asked the child

if it was just a cat or a dog.

You can see

that the child’s eyes revealed a little confidence compared to the last time

. He replied It’s a pity that he said it’s

just a cat. It’s a pity that he got the wrong answer again

. As teachers, we gave him a cross again

and told the children that the animal that looks like this is a dog

. Just like this, after constantly looking at the photos, let the children

answer .

As long as the child answers the wrong answer, we will ask him to correct it

. Over time, the child has seen too many pictures of cats and dogs,

and he will probably know what a cat looks like and what a dog looks like

. So you can

ask him if it is just a cat or a dog. He will tell you very firmly and confidently that this is

just a dog. The same learning method

can be applied to the machine.

We can input a picture into the machine

or input a picture into the computer

and ask him if it is a cat or not.

What about dogs? After entering into the computer, some programs will be triggered

. Behind these programs, some mathematical skills are actually used. Here

, mathematical skills and programs can be combined

. We can also call it a model,

so here we can say that

we You can input a picture into the model

and ask him whether it is a cat or a dog

. At the beginning, the model is the same as an ignorant child. He

does not know what is a cat and what is a dog

, so he guesses randomly. It’s just a dog. Obviously

he got the wrong answer

. We have to give him a cross at this time

and tell him that the one that looks like this is a cat.

Please correct the model and that the one that looks like this is a cat. Then

we took another picture and entered it .

Input it into the model in the machine and ask him if he is a cat or a dog

. This time he answered that it was a cat.

Obviously

he answered wrong again. Give him a cross

and tell him that the one that looks like this is a dog. Please correct the model

and it will pass. Continuous training like this,

constantly show the machine a lot of pictures of cats and dogs

. If he gets the wrong answer, ask him to correct it

until the model has a certain accuracy rate.

Input a picture of a dog

and this model tells you that this is a dog

. After training the machine or training the model,

suppose you have a new picture in the future

and you want to ask whether it is a cat or a dog

, you can input it into

In the model, he will tell you that this is just a dog

. Next, let’s look at an example.

Suppose we have some data on house sales . These data are

the same as the square meter of the house and its corresponding sale price. We want the machine to start from

Find out the rules from these materials, find out

the rule between the number of square meters and its corresponding selling price

, that is to say,

we want to guess what the selling price should be based on the number of square meters

. In this way, we can say that the number of square meters is called The characteristic feature

price is called the label label.

We guess the price based on the square number,

so the square number is called the feature price is called the label

. Then the training process is similar to the previous one.

We can input the square number, which is the feature, into the model

. The model is ignorant at the beginning.

So he will make random guesses.

Suppose he guesses 4 million

, let’s take a look at the information we have.

According to the information we have in hand, the price of this 50-

ping house is 5 million

, which means our label is 5 million

, which is obviously far from his guess

. So at this time, we tell the model that the

house is The selling price is 5 million. Please correct it.

Next , enter the next data of 66 pings

. Then enter the model and he guesses 9 million.

Let’s look at the data in hand . It is 6.5 million and

the label is 6.5 million.

Constantly revising, constantly

looking at the data

, until the final prediction of the model is consistent with our real data,

that is, when the error value of the label is not large

, we say that the model training is completed. After

the model training is completed

, suppose you have a house and want to sell it It’s 72 pings. You

don’t know how much he should sell

. You can input the feature of 72 pings into the model

and ask him to predict how much he should be able to sell

. You may have doubts at this point. The price of

a house

should not just be It can be determined by how many pings it has. We

may also need to look at its location, its layout and other comprehensive factors to evaluate how much a house is worth.

There is nothing wrong with it , so you can collect more complete information.

In addition to the number of pings, we have also collected his location and the number of rooms

, so now the machine needs to estimate the selling price based on the number of pings, location, and number of rooms

. Now we will talk about the number of pings, location and number of rooms

The room is called a feature, and the price is also a label . Because

we estimate the price by the number of square meters, the location and the room, these three are the characteristic price and the label

. The learning process is the same as before.

We input the characteristic square meter, location and the number of rooms.

Into the model, the

model is ignorant at the beginning, so he will guess randomly

. Suppose we guess 4 million at the beginning. Let’s take a look at the data in hand, which is

the label. The label is 5 million, so please correct the model

and then enter the second data,

the same. Ask the model to speculate that he speculates 9 million.

Look at the label, it is 6.5 million, and then ask the model to correct it

. Just keep training

until the error value of the model reaches a certain range

. We say that the training of this model is

completed.

In the future, if

you have a house

It is 72 pings, there are 6 rooms in Washington that

you want to sell. If you want to guess how much he should be able to sell

, you can input it into this model.

He estimates that it can sell for 7.3 million

. This is probably the process of machine learning.

Next , we will go directly

Implementation In

this class, we are going to use colab

. Colab is a python writing environment provided by Google for free.

It allows us

to write python programs very quickly and easily on the browser

, but because it is provided by Google

, you need to create a Google first. Account and log

in . After logging in, you can see 9 dots here

. Click to find cloud hard drive.

We can see a new addition in the upper left corner

, and then click here.

If you have found Google colab here, you can

Click it. If you

can’t find us, click to connect to more applications . Well, we can find colab here. Click to enter

and click to install

. Click to install. He may ask you to log in.

Then you

log in. After the installation is complete,

we will Just click OK to complete

OK, then we will add more here, you

can see that there is an additional Google colab,

click in, after clicking in, you can see

that it has helped us create an environment for writing Python

. First, let’s see The upper

left corner, the upper left corner

is the name of the file

, you can modify it , assuming

I change it to the environment to build OK,

then the modification is over, and because I am more used to the background being black

, so I go to the tool to find the settings

Then here is a theme in the website

. We select Dark

and click to save the background and it will become black.

Of course , you can also set your favorite color

. On the editor side, you

can also set the size and spacing of the text, etc. Waiting, I

leave it to everyone to play by yourself

. Before we write the program, we need to do the connection action.

We can go to the upper right corner to find a connection point

, and he will allocate some resources for us to use

. Wait for him After

the connection is completed, we

can see the block in the middle . The block in the middle is where we can write programs.

First of all , we

can enter the program code directly in this grid.

Suppose I enter print 87

, and I will enlarge it a bit for you to see. It’s

clearer. Well, after printing 87

, there’s an execution button here, click

it , and it will start to execute

, and the execution result will be displayed below.

In colab, the code can be divided into pieces.

We can put the mouse Put it a little bit above this grid, and you

can see that it jumps out a code or text.

If you put it a little bit below, it will jump out.

Suppose I click to add a code

, and it will have an extra grid

. That 's what we You can write python programs in this grid

. Suppose I write print 88 and press execute

. You can see the execution result will be displayed below

. Let’s take a look. If I click to add text

, it will jump out of a grid where I can write text

. Suppose I Write a hello, me, hello, everyone

. In addition to writing text

, you can also choose some fonts, fonts, etc. You

can see that there is an additional text grid here

. OK, we can add a lot of code grids, and we can

also add a lot. If you don’t want

the grid of the text ,

you can delete it here

. There is a symbol of a trash can, click on it and delete it.

Delete Delete Delete OK

. Colab has another important point , which is that

it can provide free GPU. Let us use

it with TPU. GPU and TPU can help us speed up a lot in computing

. We can find the editor on the top

and then there is a notebook setting

. You can see that there is a hardware acceleration here.

You can choose GPU or TPU Then in the following classes, we will also use GPU for acceleration

, so suppose I choose GPU here and press save

, you can see that he will help us re-allocate the connection,

and we have to wait for him

. After he is connected, he

will Ask us if we want to delete the previous execution stage.

Suppose I press cancel first

. After connecting the GPU,

we want to see which GPU he uses for us

. We can type it in the grid of the code

! Press nvidia-smi to execute,

and you can see that the GPU he is using for us now

is this Tesla T4 OK

. Then, our environment construction is complete.

The first mathematical technique I want to introduce to you

is called simplicity. Linear regression is simple linear regression

in English

. You can see that there are two words in front of its name,

so you can imagine that it is very simple

. Don’t worry too much.

Let’s first describe the situation we encountered today

. Suppose today you are a new The boss of a start-up company

, you want to hire your first employee

, but you are not sure how much salary you should pay him

, so you go to the market and

collect some information about people in the same position

. These data are their years of work.

With the corresponding salary

, let’s take the data here and draw it in a graph, so that

the x-axis is the seniority and the y-axis is the monthly salary. Each cross

here

It means every piece of information you have collected

. From this picture, we can easily see

that the seniority is directly proportional to the monthly salary,

that is, the higher the seniority, the higher the monthly salary

. Well, now the problem comes.

You are a new company.

The boss of a start-up company , you

want to hire your first employee

. Your first employee comes and he tells you that his seniority is two and a half years, which is 2.5

. So how much salary should you offer him? You can do it at

this time. Simple linear regression is used.

Since seniority is directly proportional to monthly salary

, we can probably use a

straight line to represent these data.

Simple linear regression

is to represent the data with a most suitable straight line

. Suppose now that we have found this The most suitable straight line

, then we can bring the 2.5 seniority

into this straight line

to see how much salary we can give this employee

. After bringing it in, we

can see that we can probably give this employee a salary of 51K

. Now The problem

becomes how to find the most suitable straight line?

Let’s first look at how

a straight line can be represented mathematically

. We can write y=w*x+b to represent a straight line

. Then , this formula Applying it to the present example,

y will be equal to the monthly salary and x will be equal to the seniority.

So the problem now becomes that

we have to find the most suitable w and the most suitable b

to represent these data.

We have to find the most suitable w and the most suitable b

to represent this straight line

. Well, let’s implement it directly to see

what kind of straight line will be produced

by different w and different b.

First, let’s read in the data first

. We can use the Pandas module

To do the reading action.

Here I call it pd

Pandas. This module

is a very useful data processing tool

. We don’t need to install it in colab

because the default is already installed

, so we can directly import it for use

. Next The URL of the

data we want to use can be found in the course file

. I use a variable to represent it. Here,

pay special attention to the file we want to read

. It is a CSV file

, so we can use Pandas

under it . read CSV to do the reading action.

We only need to write the URL into the parameter and it can be read

. Here, I also use a variable to represent it.

We can directly display the read result and execute it

. This

is what we do The data to be used,

seniority and his corresponding salary

, there are a total of 0-32, that is, 33 data. We want to use a straight line to represent these data

. We have said that a straight line can be written

in

mathematics as y equals w times x plus b

. In our example,

we want to use seniority to predict salary

, so seniority will be x and salary will be y.

We first separate x from y, and

we

want to get the column of years experience. If you want to get this column, you

can

put brackets years experience behind it so that it will separate out that column. The

same is true for x and y.

I want to get the column of salary

, so I will display it to see

if there is no problem with x and y.

Then Well, let’s first draw the pictures corresponding to these materials to see.

I ’m opening another grid. If

you want to draw pictures, you can use a very useful suite called matplotlib.

We use the pyplot under it.

I call it plt,

which is the same as matplotlib. The presets in colab are also installed

, so we can directly import and use

them. I can use the scatter under it to draw the picture.

We only need to pass the x and y

that we just separated into it, and it will help us draw the picture.

Then Finally, I want to display it and write a show

to execute it. You

can see that it will display our data bit by

bit

in the form of a point. If you want to change the style of this point

, suppose I want to put I want to change its color into a cross

. Then I can add some parameters

. I can write another marker equal to

the original

mark . I want to change it into a cross, so I can write it. x

Then I also want to change the color. I can set the color

to turn it into red. I write Red so that it can be executed.

You can see the mark

and it will turn into a red cross. As for what other marks and colors can be used here

, if you are interested I leave it to you to research by yourself.

Here

I can add a title to the picture. I can set its title

assuming that my title is seniority corresponding to his salary

. Let’s execute it. You

can see that there seems to be a little mistake

in this part of the title. It has

4 frames. The reason for this error is that

matplotlib does not support Chinese by default.

If we want to display Chinese

, we can add Chinese fonts by ourselves

. Well , let’s add it

. First, we need to download one first. Chinese font

Here I create a new grid to do the downloading action

. We can use the wget tool to do the downloading

, but it is not installed in colab by default

, so we need to install it first. We

want to install the package in colab

Module, you can enter pip install what you want to install after the exclamation mark. After installation, import it.

Then we can use the download below it

to do the download action . Just

write the URL of

the thing you want to download in the parameters.

That’s ok. Then

this URL can be found in the course files

. This is the font we want to download

, so we can execute it directly

. It will be installed first and then imported. After the

download

is complete,

let’s take a look. You can find a file

on the left and open it

. This is what we just did. The downloaded font

is called Chinesefont.ttf

. Let me close it first. After

downloading the font , we can add it to matplotlib.

If

we want to add fonts, we will import matplotlib. I call it mpl

and then import it from matplotlib. The font_manager under it

is imported into the fontManager.

After importing everything here, you can use the addfont under the fontManager to add fonts

. The font we just downloaded is called Chinesefont.ttf,

so we write Chinesefont.ttf here.

After adding the font, we It is also necessary to set this font to be used now

. To set this font to be used

, we can use the rc under mpl to set something. What we

want to set is the font, so the first parameter writes font

and the second parameter specifies to use The font is Chinesefont

, so it should be fine.

Let’s add the font first

, then set the font to be used

and execute it again

. You can see that the Chinese can be displayed normally .

In addition to setting the title of the chart

, we can also set it. The x-axis and y-axis labels of the chart

To set the x-axis label, we can use

the x label to do the setting

, then our x-axis is the

seniority, so I write the seniority,

and then the y-axis

, the y-axis is the y label, and

the y-axis is the monthly salary

Then its unit is thousand,

so we write the monthly salary

Execute the brackets again, and you can

see that the labels of the x-axis and y-axis are now up with seniority and monthly salary

. Next, let’s draw the straight line.

Here I’ll add another grid. As I

mentioned just now, a straight line in mathematics can be Written as y=w*x+b

, pay special attention to the fact that we want to use the value of x to predict y,

so this y is different from our y,

this y is the predicted value

, and this y is the real data,

so Let's go back to the bottom

and we can use x to multiply a w plus b

to represent our predicted y

, so I call it y_pred to represent the predicted y.

Now,

suppose I first set w to 0 and b also set Let’s take

a look at what the straight line drawn by w=0 b=0 will look like.

To draw a straight line, we can use the plot under plt to draw it

. The first parameter is to write the value of x.

The second parameter is to write

Let’s see the value of y. This time, our y value is the predicted result of y_pred. We

also need to display it and write show

OK. Let’s execute it and see.

You can see that this is

the straight line drawn when w is equal to 0 and b is equal to 0.

Then I

You can also change its color.

Suppose

I set its color to blue

, which I think is good. Then I will

draw the above information and these crosses into it, and

I will put all these here. Copy it,

copy it here to execute it again

, this is our prediction line , these

are real data , our goal is to find a straight line that can best represent these data

, now this straight line is w=0 b=0

So we need to find the most suitable w and b to represent these data

. Before that, we can add some legends to the chart.

Our plot is to draw a straight line,

and the scatter is to draw these crosses.

We can Add a label behind it

, the legend is the prediction line

, and then

we can also add a label

on the crossed side

, which is the real data. After

adding this label , we can write it here plot.legend

will display the legend, let’s execute it,

and

you can see that the graph here is displayed.

This

straight line is the prediction line

, and the crosses are the real data. Let’s first write these here as a Function

This way, let us bring in different w and b

. I call it plot_pred

, and then it can pass in w and b

to indent it.

Let’s try to bring in different w and b

to draw it. What will the result look like?

Here we can call plot_pred.

Assume that I put in 0 and 0 first and execute it once. You can

see that the result is the same as just now

. Suppose I modify the value of b and

I change it to 10 and execute it again

. Seeing this line, it seems that it has not changed

, but in fact it has changed

. It is because

the values ​​of the x-axis and y-axis are not fixed

, so the graph looks like it has not changed

. So this Let’s fix the values ​​of the x-axis and y-axis first. If we

want to fix it, we can write

plt.xlim like this. We set its maximum and minimum ranges.

If the x-axis is about 0-10, here it is There are a little more

, so I set it to 0-12 , here we can write it like this, write a list

in it

, and then set its minimum and maximum values

and

the value of y is the same

, we set its maximum and

minimum. On the other side, it is possible to get negative numbers

, so I let it be -60 at the minimum, and

140 at the maximum.

Then we execute it again, and you

can see that the values ​​​​of the x-axis and y-axis will be fixed now.

Execute once and

the values ​​of the x-axis and y-axis will be fixed

. The y-axis is -60 to 140 and the x-axis is 0 to 12.

Let’s try again. The result of 0 and 0

is the result of 0 and 0. This line is when y is equal to 0 In this place

, if I bring the value of b into it and change it to 10 and execute it again

, you can see that the straight line is going up

, so we can know that the value of b is to control the up or down of the straight line. If

it is changed to 40,

it will go up . If it is -30, execute it. You

can see it and go down

. Now let’s try to change the value of w

. It was originally 0. Let ’s change it to 10 and see.

This line has become slanted

and slanted up

. If it is changed to 20,

it will be

more slanted . If I change it to a negative number -10,

you can see that it will fall down and slope downward. If it is positive, it will slope upward.

Negative

number If it’s a downward slope, I’ll try -5

on this side ,

and I can see that

it’s a line like this. We can’t see this side because it’s beyond the border

. Then let’s make this picture dynamic

, so that we don’t have to.

I have been manually adjusting the values ​​of w and b.

If I want to add some interactive components in colab

, we can use the ipywidgets tool.

Here I introduce the interact in it

. Then I can do this and use it. Write a function in the first parameter

Name Then what we want to dynamically adjust are the values ​​of w and b

, which are the two parameters to be passed in by this function.

Here I can set the value of w

. I hope its range is between -100 and 100 and then Its spacing is 1,

and then set the value of b

. Suppose I want it to be -100, and the spacing between 100 and 1 is the same

. Let’s execute it directly. You can

see that there will be more interactive components here

. The default value is w and b is all 0 and I can make adjustments. When

I add w to 8, the line will look like this.

Then add b to the top

and it will slowly move up and then add more.

Everyone can play around and

adjust different ones. Let’s see

what the straight line will look like with different values ​​of w and b.

Well, I’ll leave it to you to play with.

After seeing what kind of straight line will be produced by different w and b

, then our question will become what

is the best way? What about the straight line suitable for these data? Let

’s define it

and give each line a score.

Let’s first look at a set of relatively simple data

. If I also want to use a straight line to represent these data

, suppose today I am Using this straight line

, this straight line is the result of w equal to 0 and b equal to 0

, then we should give this straight

line a score for how well it fits these data , or you can say, how well

this

straight line fits these data

Let’s give a score. Then how do we give this line a score? It

’s very simple.

We

just need to calculate the distance between these real data and this line

. Because if the data is closer to this line If they

match, the distance between these data and this line will be smaller

, so we only need to calculate the distance between each line

and these real data , and then find the smallest one among them,

and that line will be the most suitable for these.

The straight line of the data Let’s

do an actual operation directly

. In this example, there are

a total of three data points

, which are respectively at the positions of 1, 1, 2, 2, and 3, 3.

We want to use this straight line to represent these data

. This straight line is w Equal to 0 b is equal to 0

and then we want to

give a score for the suitability of this straight line with these data.

How to score ?

We can use the sum of the distances between these data and this straight line

Use it as the basis for scoring. If

this point is 1-0, the distance between it and the line is 1.

If this point is 2-0, the distance is 2. Here, 3-0 is 3.

So our formula will be written like this,

you may see Here, besides 1-0, I also squared it

. The reason for doing squares here

is

because it is convenient for us to calculate

, because you may have negative numbers in the future

, so you can solve the problem of negative numbers by directly square.

Solved So in this example, the square

of 1 minus 0, the square of 2 minus 0, the square of 3 minus 0, the sum of

the final

distance , it should be said that the sum of the square of the distance will be 14.

Let's look at another line

. Suppose we want Use this line to represent these data .

This line is the result of w equal to 2 and b equal to 0.

We also calculate their distance

. The point here is also 1, so here is 1 minus 2, 1 minus 2

Then here is the square of 2 minus 4, 2 minus 4

, 3 minus the square of 6.

You can

see that there will be a negative number here

. If we don’t have a square, there will be a negative number here

, so after the same calculation here

Do the sum

, the same

is 14. Then look at the next line.

You can see that this line is quite consistent

. It is the

result of w equal to 1 and b equal to 0. Then calculate the distance

, and you can find that each of them has the same value

, so the calculated distance will be

If the three lines are 0 , we will say that the line where w is equal to 1 and b is equal to 0

is the most suitable straight line for these data

, and then the scores

or distances generated according to different w or b

The sum of squares , we can write it as a function

. This function is called cost function

in Chinese, it is a cost function.

In our example

, the cost function can be written like this.

It takes the real data to subtract the predicted value and then squares it

, that is, we To calculate the real data, the square of the distance

from that straight line will be the cost

, which is what we just said about the score

. Let’s go back to the original example

. We want to find a straight line that best fits the data

, so we need to help the straight line evaluate

A score , that is, to calculate the cost here

, let’s assume that all the straight lines now

have their b equal to 0

, let’s see how much the cost will be corresponding to different w,

we will bring in one by one

, first bring in one w is equal to -10 or so, and its cost will be here

, then bring in the second and third

, bring in all the way, bring in and bring in,

finally we bring in many, many points

, we will find that the cost function is actually the cost function

, it will It is a parabola like this.

This is the case

where b is equal to 0. If b is not equal to 0, what will the cost corresponding to w and

b

look like? The value of w here is the value of b

, and the z-axis is the value of cost

. You can see that its graph will look like this.

The red point is the point with the lowest cost

. Let’s look at two other angles. From these two angles, we

can You can see

that the place with the red dot, as I just said, is the place with the lowest cost

. We can see

that the place where w is equal to 10 or so

, and where b is equal to 20 or so,

will produce a point with the lowest cost

. This is our goal. Well,

our goal is to find the most suitable straight line

. To find a most suitable straight line

is to find the distance

between the straight line generated by the most suitable w and b, the most suitable w and b,

and all the real data. The sum of the squares will be the smallest

, that is , the place where the cost is the smallest

here. After we understand it,

then we will directly implement the cost function

. Here, I opened a new colab file

to implement the cost function.

At the beginning, we need to read the data first. The method of

reading the data is the same as before

, so I directly copy and paste

it for execution. After the data is read,

then we will implement the cost function

and open a grid.

Our cost function

is Take the real data and subtract the square of the predicted value,

so here we can write it like this

, first calculate the predicted value,

I call it y pred

, it will be equal to w, multiply x and add b,

then w here I don’t know anything about b and b yet

, so let’s assume that w is equal to 10 and then b is equal to 0.

After calculating the predicted value, we can take the real data.

The real data is y minus the predicted value y Pred

, and then we need to square it

. So I put it in brackets and square it.

This side is called cost

. Let’s display the cost directly. You can

see that there are a total of 33 values ​​displayed

. These 33 values ​​are the real data minus the prediction. Value

and then do the squared result

, which is our squared distance.

If we want to sum up the squared values ​​of these distances

, just write .sum() after it

to display it

. You can see that this is the summed up

The result is the sum of squared

distances . If the value here is too large, we

usually average it

. We have a total of 33 records,

so we will divide by 33 here

, but it is better to write it like this

. To calculate the length of

x, the length of x is 33,

so dividing this by 33

is the average of the squared distance. Next, let’s write the cost calculation here

as a function

, so that we can bring in different w and The value of b

is good

to open another grid . Here I call it compute cost

, and then it needs to pass in our data

x and y and the predicted line, its w and b values.

The calculation method is the same as here

. Paste it directly. We

need to do the predicted value first.

The predicted value is w multiplied by x plus b

, then calculate the cost

and then add it up and take the average.

I will make the final cost equal to the sum and then take the average

result. Well, let’s pass this cost back, let’s try to see

what kind of results will be produced by adding different w and b, let’s call

here

to see

that xy is the real information here, and w

we just brought 10 and b 0, w is

the same as b, let me try to execute 10 and 0,

he said that the compute cost is not define, oh, it has not been executed yet, so it

needs to be executed first and then executed here, and the value returned is 602 points

, which is the same as before. The input is 10 and execute it again.

You can

see that the cost this time is 227 points

. Then you can import it yourself

to see

how much the cost will be generated by different w and b

. Then let’s try it

when the fixed b is equal to 0.

In this case , let w be between -100 and 100

, and then let's see what its cost will be.

Here , I will first define a cost, let it be a list

, and then store the cost value of w between -100 and 100.

OK What about here,

we can use a for loop to do calculations

. I let w make it range from -100 to 100. Here

I want to write 101

, and it will generate a similar one

from -100 to 100. the list

Then we have to

use the function on the cost calculation side to do it directly

. Use xy and fix b equal to 0.

Let it always bring in the value of w.

This side is called cost

and then append it to our costs. In the list, I finally

displayed it to see how to execute it

. Here, the total number of w equals..

Wow, there are a total of 201 values

​​​​from w equal to -100 to 100. I will turn off

its cost

first

and then Let's draw the cost corresponding to different w

into a picture to see.

Here we introduce our drawing tool matplotlib and the pyplot under it.

I call it PLT.

Then we can use scatter

to put each data point into it. Draw it.

We draw each w and the corresponding cost

. The w on our side is between -100 and 100. The cost corresponding

to

it is displayed and executed.

You can see that it is between -100 and 100.

There will be a total of 201 points. The

201 points will be so densely packed

when drawn. Besides drawing like this, we can

also directly connect them into a line

or a parabola. Here we use plot to pass in w and Corresponding to the cost

and then execute this parabola to make a connection.

Here I will help him add a title and a label

. This one is the cost function when b is equal to 0 and then w is between -100~100

. Then set its xlabel

xlabel is The value of w

and then

execute the ylabel cost

to see that the title and the labels of the x-axis and y-axis are all up

. Then we also take the value of b into account and let

the value of b also range from -100 to 100.

Let’s see what the cost will be How much?

Here I introduce a very useful tool for matrix operations, numpy

. I call it NP.

Now both w and b are between -100 and 100.

We can use arange under NP . The usage of this arange is

basically

the same as above. The range is almost the same.

I want to create a matrix between -100 and 100, and the interval is 1.

We can write it like

this. Here, I call it ws because there are many w

and b are the same. I call it bs

Then I create a two-dimensional matrix.

Here I can use NP point zeros

to create a matrix with all 0s in it.

The first dimension is 201 and the second is also 201

because the value of w here is total. There are 201 b values ​​and 201.

This matrix is ​​to store the costs corresponding to

different w and different b

, so I call it costs

and then you can do calculations.

I first use a for loop to run through all The value in ws

and then use a for loop to run through all the values ​​in bs

, and then we can calculate their cost.

If we want to calculate the cost, we have already written the function above,

just take it down and use

it directly copy here Come down

, well , xy is the real data

, and then the cost will be calculated by passing in wb here, and

we can store it in the costs matrix.

Here , we define another i, which is equal to 0 at first

, and then define j here. If it is equal to 0

, then the values ​​of i and j here will be equal to this cost

, then let j add 1

here, and finally let i add 1 here,

after the calculation, all w and all b

will

be added The cost corresponding to the combination is stored,

stored in this cost

, and displayed to see the execution.

He said that numpy has no attribute

and wrote an extra r here

to execute it again.

He needs a little time to wait for him

. It looks like this A two-dimensional matrix

. Each value in it

represents a corresponding cost calculated by w and b. Next, let’s

draw the cost calculated by considering w and b at the same time.

Take a look Because

we need to consider the values ​​​​of w and b at the same time

, we will draw a 3D graph.

To create a 3D graph, we can write plt.axes

and set its projection=3D. I call it AX

and we will display it first. Come out and

execute it

. You can see that a 3D image is created

, but there is nothing in it now

. We can see that the edges of this image are a bit gray. I

don’t want it to be gray

. I can also make a setting

. I can Use ax.xaxis.set_pane_color to set the color . I want it to be white

. I can write an rgb value here. For

white , let it be 0 0 0

for easy execution. He

said that there is no attribute

and it is reversed here

, so you can see it

When it comes to the x-axis side,

it will become white.

Then I will use it on the y-axis and z-axis

side to copy it directly

. Just change it to y and z

to execute it again

, and it will be white. It feels much more comfortable.

Next , let’s draw the cost corresponding to w and b as a surface graph

. Here we can use the plot_surface under it

to compare the w value we just created with the value of b

, that is, ws and bs These two matrices are passed in

, and the last is the cost we stored.

Pay special attention here

. In fact , we don’t just pass in the two one-dimensional matrices of WS and BS

. What we want is the two one-dimensional matrices. A two-dimensional grid is generated . If

we want to generate this two-dimensional grid

, we can use the mesh grid under it in numpy,

and here we pay special attention to the first one. We first pass it into BS

and then pass it into WS.

Then it will be It will be made into a two-dimensional grid and sent back to us

. I will call it

b grid w grid.

If you want to know more about what this two-dimensional grid is and how

it works

, you can take a look at this URL for

his

explanation.

It is very detailed and I will not go into details here . If you are interested, I will put the URL below and you

can

study it yourself . Here we will change it to pass in w grid and b grid

, so it will be no problem. We can execute it and you

can see this surface The picture is displayed.

Let’s add some titles and labels to this picture

. Here, we can write ax.set_title.

Here, pay special attention to the fact that we need to add an additional set in front of it,

which is different from the previous one. In the

front , just write title directly . If we want to add a set

title, we will let it be the cost generated by wb,

that is, the corresponding cost

will be used in Chinese. If it

is used in Chinese, we need to add Chinese fonts

. So I will go back to the previous file. Inside, the Chinese language is added in this place .

I still need to download it first, so I will paste it here

. This is to download it, execute it first, and

then add it

. This is to add it

, so I will paste it here.

First

download and then add the font. After the title is set

here ,

then we set the labels of the xy and z axes

. Here , we also need to add set xlabel

. The label of the x axis is w

, and then set it..

here I directly Use copy

to set its y-axis label, the y-axis label is b

, and then the z-axis, and the z-axis is cost

Ok, let’s display it again,

you can see that this is w

and then b, and their corresponding cost is the z axis

. If we want to rotate this picture, it is also possible.

If we want to rotate,

we can do it here Set a view init,

which can pass in two parameters.

The first parameter is the rotation angle of up and down

, and the second parameter is the rotation angle of left and right.

Suppose I write 45 for the first one and -120 for the second one. Execute it again

, and you can see now As shown in the picture, it turns like this:

this side is w, this side is b, and this is the cost

. Well, you

can also play with the angle of rotation.

Then I think the color of this curved surface is not very good-looking,

and I want to make it

One more modification, you can set another parameter here called cmap

, I set it to spectral r,

let’s execute it and see,

you can see that the color is much better now

, as for what other colors can be set here, I

will leave it to you Go research,

I think this color is good

, and then we can set another opacity value

, the opacity is called Alpha Alpha

, let it be equal to 0.7

and execute it,

you can see that now it has a more transparent feeling, it

looks like It’s more comfortable.

Then, if we want to make this picture look better

, I can add a border to it.

I can write a plot wireframe

and pass in the first three parameters

. Then I set the color of the border to black

, so let’s execute it.

You can see that this border is added now

, but it is a bit too dark

. We can also set its transparency Alpha

assuming that I make it 0.1.

Then execute

it. You can see that it is very comfortable and beautiful.

Next, let’s put Find out the point with the lowest cost. To

find out the point with the lowest cost

, we can use the min under it in numpy

to find our entire cost.

I will print it out to see what

the lowest

cost is

. The lowest cost It is more than 32.69

.

If we want to know

the w and b corresponding to the cost of more than 32.69, we can do it like this. We

can use np.where to find its location and find the lowest cost among all the costs

. Where is the index of the position ,

because this cost is a two-dimensional matrix

, so it will return two values, that is, two indexes.

I call it w index and b index

. Then we can put these two values Print it out and

execute

it. You can see that the index it finds is at 109 and 129.

We need to find out the w value corresponding to this index from all w matrices . The part of b is the same

. OK , let’s execute it again.

Once, you can

see the corresponding value.

The corresponding w value is 9 and b is 29.

That is to say, when w is equal to 9 and b is equal to 29, there will be the smallest cost.

Here we can write like this.

When w is equal to this value and When b is equal to this value, there will be

the minimum cost

. Then the minimum cost can be obtained from the two-dimensional matrix of costs. The value

to be obtained is the value corresponding to the two indexes

. Let’s execute it again and

it will be Speaking of which,

when w is equal to 9 and b is equal to 29, there will be the minimum cost

, and the minimum cost is more than 36.69.

Finally , we can also

draw the point of the minimum cost. If

we want to draw it, we can Use

the first value of scatter to pass in the value of w, then the value of b,

and then the z-axis, which is the value of our cost. It is easy to execute. The point where you

can

see the smallest cost is here, so

I can put it Change the color

, set its color equal to red

, and then I want to make it bigger

, set its SS to be the size

, and let it be 40, so I can execute it again.

You can see that it is much more comfortable

. Finally, I want to make this To make the picture bigger

, I can go to the top to set it

. I can write plt point figure

and then set its figsize.

He can set two values.

These two values ​​represent the width and height of the figure respectively

. Suppose I start with 5 and 5 are fine.

Let’s see what it

looks like. It’s still a little small. I’ll make it a little bigger.

7 and 7

will make it look more comfortable

. Well, the other parameters here are the rotation angle. The size of

the picture Or color, etc.

Everyone can make adjustments by themselves

. This is the implementation of our cost function. After

reading the cost function , our question will become

how to find the best w and b efficiently. The best

w and b correspond to the point with the lowest cost. It is not difficult to see

from our last implementation

that we use a brute force method.

I exhaustively enumerate all the values ​​​​of w from -100 to 100

and then go to Look at its cost

and find the lowest point from

it. The w and b considered here are the same.

I exhaustively enumerate all the combinations of w and b from -100 to 100 , and then find out

the corresponding cost. Its lowest point

, but this is not a good way

. We must efficiently find out the best w and b.

To efficiently find out w and b,

we can use a method

called gradient descent, which is gradient descent in Chinese

. Everyone Don’t think too hard about it

. In fact , gradient descent is to change the parameters according to the slope

. In our example, the parameters are w and b,

so the values ​​of w and b are changed according to the slope

. Let’s directly look at the gradient descent. How does it work ?

Let’s take this as an example

. Let’s assume that b is equal to 0 and only consider w.

How to find an optimal w

that can minimize the cost? First,

we need to set an initial value of w . Then

The initial w value can be set randomly.

Let’s say

I set it here

, which is roughly equal to -75. Then we can

calculate the tangent slope at this point through differentiation.

We can

calculate the tangent slope at this point through differentiation

. Everyone here

Pay special attention

when we use gradient descent

. In fact, we don’t know the blue line.

We don’t know the blue parabola.

This parabola is obtained through our exhaustive violence

. So you don’t

I don’t know that the lowest point is here

. Let’s re-

describe the problem now. It can be like this.

Today you are blindfolded and thrown into a place

where you can only go forward or backward . What is your goal? You

have to go to the lowest point , but fortunately, you can use some method to calculate the steepness of the front and rear

of your current position, and then you can use this steepness

to find the way to go down

, back The same as the original example,

at this point

, you can

calculate the slope of the tangent line through differentiation.

The slope of the tangent line is equivalent to the degree of steepness

. Then we can use this slope

To find out the way to go

down , let me briefly explain

how the slope is calculated.

First of all, we need to set what the cost function looks like

. In our example, the cost function looks like this.

We use real data. To subtract the predicted value and then square

it . In mathematics

, it is to subtract ypred from y and then square it.

This ypred can be expressed as w times x plus b

because we want to represent the data as a straight line.

So here is w multiplied by x plus b

and in this example b is equal to 0

, so we can omit it directly

and then we just need to differentiate it with respect to w to

get the slope of the tangent line

. After differentiation President, I will not show you the

detailed differentiation process. If

you are interested, you can study it yourself

. In fact, it doesn’t matter if you don’t know what differentiation is at all

, because there are many tools that can help us do

calculations automatically.

After the differentiation,

if we want to know

what the tangent slope will be when w is equal to -75,

then we will bring in -75

. The x and y here are our real data

, and we can calculate

the tangent slope

by bringing in the same.

How much is it ?

After we know the

slope it

is equivalent to knowing the steepness here. After knowing the steepness here

, we can go down. How to go down?

We can subtract the slope from w and multiply by A learning rate ,

what is this learning rate, I will explain to you later . Let’s look at the previous w to subtract the slope.

In this example, now w is about -75

, and the tangent slope is obviously a negative number

, so we let w to subtract a negative number

, that is to say, it will add a value

, so after adding a value, it will move forward

, then we repeat this action all

the time, and then calculate the tangent slope at this point

, then we bring the number with You can find the slope when you come in. After

you find the slope, bring it to the following formula . Now, where w is about -60

,

you can obviously see that the slope is still negative. Let w subtract a negative number

, that is, add a Value,

so it will move forward

. We keep repeating this action

. Calculate the slope update w

Calculate the slope update w Let’s take a look

Calculate the slope update w Calculate the slope update

And then keep repeating until

we are almost close to the lowest point Or at the lowest point, the slope of the tangent line

here will be quite close to 0. OK

, the slope of the tangent line here will be very close to 0.

After it is very close to 0,

we let w subtract a value close to 0

, which is equivalent to w being If there is no update

, then we will find the lowest point

. This is probably the operation process of gradient descent

.

Next, let’s take a look at what

the learning

rate is. Let’s go back

to the previous example

. At the beginning, we first find an initial w,

and then we can calculate its tangent slope

. After calculating the tangent slope, if we want to go down

, we can subtract the slope from the value of w and multiply it by a Learning rate

The learning rate is that you need to set a value

. You have to decide how much the learning rate is.

Let’s take a look here, multiply the slope by the learning rate

. If your learning rate is larger, the side will be larger. Well , it

may be the larger the positive value or the larger the negative

value. Then we subtract a larger value from w, and the change of w will be greater

, that is to say, its pace will be larger

. Conversely, if your learning rate is higher If it is small, it will be relatively small after multiplying

here

, and then we take w to subtract a relatively small value,

and the change of w will be relatively small

, that is, the pace will be relatively small

. Well, let’s see what happens with

different learning rates. The result of the appearance, let’s first

look at the learning rate. If the learning rate is high

, then his steps will be relatively large. OK, it’s relatively large. Well

, you may have a question when you see this

, is whether his steps are gradually getting smaller. That’s

right ,

he is gradually increasing. Get smaller

, even if our learning rate remains the same

, his steps will gradually become smaller.

Why ?

Because the slope here will also change

. The slope here is relatively large

, and the closer to the lowest point The slope will be smaller , so his steps will also be smaller.

Let’s see what it would look like

if the learning rate is smaller

. Then his steps will be smaller.

Let ’s compare the two pictures together.

Clearly, the one on the left

has a high learning rate, and the one on the right has a small learning rate

. You can see that the strides

on the left

are relatively large, while those on the right are relatively small

. Speaking of this, you may say that we must have set the learning rate to a high value. Ah,

because

it is set to be too large, the speed of going down will be faster,

so that we can reach the lowest point faster

. This is a very good question,

but our learning rate may also be

too large. What will it look like if it is too large

? Take a look,

you came here in the first step,

and then in the second step,

you didn’t go to the lowest point

, you just stepped over to the opposite side

, then let’s look at

the third step. In the third step, he stepped back and

you will In this way, you keep stepping over and over and over and over again, and

you will never reach the bottom point

because your steps are too big and there is no way to reach the bottom point at all.

This is the problem of too large learning rate

. Let’s see if What will happen if the learning rate is too small? Every step

you take is very small, very

small, so small that you can’t go to the lowest point even if you go forever.

So you can’t make the learning rate too large or too small

. To find A most moderate value

, how do we find the most moderate value?

We can find it through continuous experiments and tests.

Well , this is the learning rate

. Then let’s go back to the original gradient descent.

The examples just now are only considering w In this case,

if we have to consider even b now

, how will the gradient descent work? It is

basically the same

. First of all, you need to set a random initial value of w and b .

Here you see this The picture is

also obtained after exhaustive enumeration

, so you don’t know that the lowest point is here , so

we directly apply it to the metaphor

. Today, you were taken to a place inexplicably

and then blindfolded

. This place is a bit The terrain is similar to a canyon

, so your goal is to find the lowest point of this canyon

, but this time is different from last time,

you can not only walk back and forth, you can also walk left and right

, you can also know your front and back through some method The steepness

and the steepness of the left and right

, then you can find the way down through the steepness. The steepness

on this side is the same as before, which means the slope

. So here is to calculate the slope in the w direction

, and here is to calculate the b

The slope of the direction To

calculate the slope , we can also differentiate the cost function.

In this example, the cost function is long.

It is to subtract the predicted value from the real data and then square it.

That is, subtract y pred

from y and then square it. What about y pred here?

We can also decompose it into w multiplied by x plus b,

and then differentiate w to get the slope in the direction of w

. If we differentiate b, then You can get the slope in the b direction. After

the differentiation, the result will be like this

. The slope in the w direction is as long as this.

The slope in the b direction is as long

as this. If we want to know

what the slope is at this point , then

we can just bring the value into it.

Now , bring in the w here, bring in the b here

, and then x and y are also the data we have

, then we can calculate the slope in the direction of w and the slope in the direction of b

, and then we can update it The values ​​of w and b

means that we can go down

. To update w, we subtract the slope in the direction of w from w and multiply it by the learning rate

. To update b, we subtract the slope in the direction of b from b and multiply it by

Learning rate In this

way, we can go down .

After reaching this point, we will recalculate the slope in the direction of w and the slope in the direction of b

. After the calculation, we can update

it, and then we can slowly move forward.

Walking to the lowest point

, we are walking blindfolded

. How can we judge whether we are at the lowest point?

Similarly, when you get closer to the lowest point

, the slope of the w direction and the b direction will be smaller

, and then you If w and b are subtracted by a small value,

it means that there is no change

. We can use this to judge where we are

now. For the learning rate, you have to set a value yourself

. This value cannot be changed. Too big and not too small.

If your learning rate is set too high

, his steps will be very large, so

big that there may never be a way to reach the lowest point

. On the contrary, if the learning rate is too small,

your steps will be very small.

It's so small that you may not reach the lowest point even if you go to the wildest place.

This is our gradient descent. Its operation process

. Then we will directly try it out

. Okay , then we will implement the gradient descent.

First of all, we must read it first. The action of

getting data and reading data is the same as before

, so I directly use the copied one.

Then we just said that gradient descent

is to calculate the slope and then update the parameters.

To calculate the slope in the direction of w

, we can differentiate the cost function with respect

to w. To calculate the slope in the b direction

, we can differentiate the cost function with respect to b

. Let's do the calculation.

Let 's differentiate the cost function with respect to w

. The result is 2 times x and then multiplying

w times x plus b to subtract y

is the result of differentiating the cost function with respect to w

. If we differentiate the cost function with respect to b

, the difference is that one less x is multiplied here, so

the slope in the direction of w is called w gradient,

and the direction of b Well, I’ll call it b gradient

Now if I want to

know

what is the slope of the w direction and the b direction when w is equal to 10 and b is also equal to 10,

then I will first set w to 10 and b to 10

, let’s go first Looking at the good execution in the w direction,

we can see that it has generated a total of 33 values

​​because we have a total of 33 data. Every data

you bring in will generate a slope

. Here we will average it. If we

want to average,

we can First add it up , and

then divide it to see how many records it has

. Here, how many records are there? I will use n to represent it.

Let ’s calculate the length of x

, and then we can divide it by n.

Let’s execute it again.

This is the average result, which is -118 points.

Then let’s look at the b direction

again. We have 33 data,

so it will calculate 33 values

. Here we do the same

, that is, add it up and divide it by n

Execute again

, and you can see that the average result is -27.46

. In fact, if we want to calculate the average here, we can

also directly write it like this

. We write some mean, which

also calculates the average. There is no need to calculate the length of x separately. You

can see that the result of the execution is the same.

Change the direction of w here,

so they are all the same.

For convenience,

I will calculate the slope or you can say the action of calculating the gradient

and write it as a function.

Here I will Call it a compute gradient

, we need to pass in x and y, which are our data

, and then we can do calculations with the value of wb

, and we will return the calculated results to

the w direction and the b direction

, then I will open another grid to try it out Suppose

what I want to know now is when w is equal to 20 and then b is equal to 10

, what is the calculated slope

? This side needs to be executed first and then execute it.

The w direction is 537 points

, and the b direction is more than 70 points

. Calculate w and After the slope in the b direction,

then we can update w and b , because our update method

is to

take w and b to subtract its slope and multiply it by a learning rate

. We first assume that the initial w is 0

and the initial b is also 0,

here you can set it randomly, I set it to 0,

then we can calculate

the slope when w is equal to 0 and b is

equal to 0, then we can

update w according to this slope and The value of b

To calculate the slope, use the function just written.

We bring in wb, and it

will send back to us the slope in the direction of w and b

. Then we can update the value of w and b according to the slope.

For w, we

can subtract w from the slope in the direction of w

, and then multiply it by

a learning rate. After my tests and experiments

, I think 0.001 is a

good

learning rate. Well, if you want to update b

, the same thing is to subtract b from b in the direction. The slope is then

multiplied by this learning rate . I will make w equal to the result after the update

, and then b is also equal to the result after the update

. Let’s display it

and execute

it. You can see that after the update,

w has changed from 0 to more than 0.87

, and then b has changed from 0 to It’s more than 0.14

, then let’s see if wb has changed from 0,0 to this

. Is it really reducing the cost

, that is, is it really going down?

If you want to calculate the cost, you have written it before

and copied it directly.

The compute cost is directly copied

, and we can use it directly here . Let’s see

if the cost has really dropped after wb originally changed from 0,0 to this new value.

I will print it out and put it

It is printed out and executed.

You can see that it was originally 6,040 and then changed to 5,286

, so the real cost is decreasing

, that is, we are really going down

, so let’s pause for a while. Let’s

take a look at the place where the gradient was just calculated. How is it written?

Find the compute gradient here

. Whether it is to calculate the gradient in the w direction or the b direction,

you should have found that it has a multiplication by 2.

In fact, this multiplication by 2 can be omitted.

Here we can omit it. It ’s

gone. I’ll run it again after omitting it.

Why can it be omitted?

Let’s look down here

. Do we take w

to subtract the slope in the direction of w and multiply it by a learning rate

, then b to subtract the slope in the direction of b and multiply it by a learning rate

. What we just did in the direction of w and b The slope is multiplied by 2,

which is equivalent to multiplying by 2 later.

In fact, this multiplication by 2 is unnecessary

because it will indirectly affect the size of the step

. The size of the step is controlled by the learning rate. That’s good , so

multiply this by 2. In fact, we don’t need to write

it. We can just write the double here. Let’s

just multiply the learning rate by 2

and calculate again. We

can see that the result is the same

, so we The above multiplication by 2

is unnecessary, so just omit it.

I delete the multiplication by 2 and execute it again

. This time, it is 6040 and then becomes 5656.

This is the result of only one update. We

only updated the result of w and b once.

Then

let's try to see what it looks like after updating 10 times. I will use a for loop

to repeat this 10 times and let it run 10 times. Then let's record the values ​​​​of wb and cost

. The cost here is equivalent to recording it. Let’s

use an f string to write

it. At the beginning, I will first write how many times it is now,

that is, the i-th update of the iteration

. Then we will record the current cost.

If the cost is this value

, then the value of w and b will also be written. Record it and

let’s execute it to see what it will look like

. The cost is this value at the 0th time, then w and b

let’s see if the cost is really decreasing

. It is really decreasing. 5656 to

3161 I let him have an interval , and execute it again

, which is much more comfortable

but

he has a lot of decimal points, so it will look a bit untidy

. If we want him to display only two decimal places

,

we can do this. Write it on the back: .2f

will only display 2 digits after the decimal point.

If you want 3 digits, it will be .3f

and so on. Then

I will only display 2 digits of w and b and execute it again

so that it looks neat. It’s a lot,

the cost is decreasing,

and then the value of w is updated all the way, and the value of b is updated

all the way. Let’s try it out. If you give it 20 times and then execute it,

you can see that the cost is also decreasing all the way

. When it reaches 20 times, only There are 1,705 left

, but now he seems to be untidy again.

The reason is that he occupies two grids by the 10th time

. If we want to make the number of grids occupied by him the same,

here we can Write on the back of him:

Then I want him to occupy a few squares.

Suppose I want him to occupy 5 squares, then I will write: 5 and

then execute it. No matter what the number is, it will occupy 5 squares

, which is much neater .

In addition to calculating the cost In addition to the values ​​of w and b,

I will also record the slope of w and the direction of b

. Here, it will also have a lot of decimal points

, so I will let it only display 2 digits

. Execute it again

so that everything will be fine.

It ’s recorded. Where we see the cost

, we can see that it is still obviously declining.

So I asked him to run a few more times

to see how far he can drop

.

After running 100 times, it still seems to be declining.

But we You can see that the right side

seems to be becoming irregular again, and

this side is becoming irregular again.

The reason is that our numbers still have sizes,

maybe two or three digits , so

they will still become irregular.

Here, if we really want to make it neat, we

can use scientific notation . Here, I can write .2e

and change it to .2e, then it will display two

digits . Others are presented in scientific notation.

What Is it a scientific symbol? Let’s see if

we execute it

. You can see that this is a scientific symbol

. It writes 5.66e+03

, which means 5.66 multiplied by 10 to the third power,

which is equivalent to 5,660

. If what you see here is 5.66 e-03

is 5.66 multiplied by 10 to the power of -3.

We use scientific symbols to represent it

, which is much neater. We see that the

cost is still falling,

so I let him run a little more here .

Let him run 100 times , then

I will let him run 1,000 times

and then execute

it. Let’s take a look

. It seems that it is still falling at

42, and then it is still falling at

41. Let’s let him run it a little more

. I let him run it once.

10,000 times, it is too much to print it

10,000 times, so I asked him to print it only once every 1,000 times.

Here , I judged that if it is divisible by 1,000,

I would print these. Information , execute it again

, then it will only be printed once every 1,000 times.

Well , we can see that

this side seems to be a little messy.

The reason

is that it has an extra minus sign

. If this problem needs to be solved,

we can add one in front of it.

Blank If you add

a blank, it will give up one more bit to represent the symbol

. To indicate this symbol

, let each side give up one more bit

and execute it again.

This problem will be solved .

Let’s take a look at the place where the cost is still declining.

After 10,000 times, it seems to be still declining. He has multiplied 3.51 times 10 to 3.39

. Let’s let him run another 20,000 times to see.

Here, let him display more decimal places

. I asked him to run .4e

and execute

it again, so that he would display 4 digits

, and it continued to decline,

but the decline became very small

. We just let him run it 20,000 times, and we can

see that the decline is slow It's getting smaller and smaller.

Let 's take a look at the slope next to

it. We can see that the slope in the direction of w and the slope in the direction of b are also very, very small

, almost close to 0.

Then I will directly write

this gradient descent process

as a function

, so that It is convenient to use later. I will call the name of

the function

gradient descent

. There are

many things to be passed in. The x and y are our data, the initial w and the initial b,

and then the learning rate

and us. The cost function used to judge the quality

and the gradient function used to calculate the slope

are finally the total number of times you have to run run iter

and how many times you want to print out the data. I call him p iter,

and I let him preset the value of p iter. 1,000

, that is, 1,000 times, it will be printed once

, so here we change it to p iter, and

for 20,000, we change it to run iter

and the initial w is w init

, so I make w equal to

the initial b of w init. B init

then calculates the function of the cost.

You need to change the compute cost

to this function

, and then the compute gradient to this

Then here, by the way, I

store the cost in the process and the values ​​of w and b. For

cost, I call it c hist

and make it a list.

For w, I call it w hist.

b is also b hist

. The three lists are used to

store

the cost and the value of w and b of each time we ran so many times.

Here I store the wb and cost after each update

. So w

hist.append Store w in

and then b is also the cost

, store it in

and then finally I can do the return action

. For the return, I just return the final w and b

and all the w processes

in our process All the w, b, and cost are all returned

so

let ’s implement it

.

He said that

there are some problems.

Create another grid

here. Before using it , we need to set the initial w and b

. Suppose my initial w and b are equal to 0,

and then the learning rate is 0.001

. I can also use science here. How to write the symbol

I write 1.0e and then -3

, that is 1.0 multiplied by 10 to the power of -3,

which means 0.001

. What about the cost function?

Here we need to pass in the compute cost

, which is the gradient function

that we use to calculate the cost. What we wrote is the compute gradient.

Here we can pass in the function as a parameter

. Then

it

is run iter. Run iter

.

I

can’t move it, I can move it, I don’t need to set it,

okay, let’s execute it and see , it will return these 5 values

, so I’ll also write it, this w is the last w, I call it w final

b is also the last For b, I call it b final

, and the stored wb and cost

are good for us to execute. Why is

it only printed once?

Let’s see why it is only printed once

. Oh,

I accidentally wrote it into the for loop in the place of return

So it should be executed

again outside , so it should be no problem.

1000.

Let’s see that the cost continues to decrease

, and the slope is also decreasing.

The final w can be seen to be more than 9.17

, and b is about 27

. I'll also print it out

to see the final w, the final w and b values

, let's take a look,

the final wb is more than 9.14 and then 27.88

. The same here, I let it not have so many decimal points, and let

it display two digits

That’s good , so

I wrote that the final wb of .2f is 9.14 and 27.89,

then we can use this final value to make predictions.

You should not have forgotten

the problem we want to solve.

Let me review it for you.

Suppose today you are a The boss of a start-up company

, you want to hire your first employee

, but you don’t know how much to pay him,

so you go to the market to collect some

relevant information about this position

, that is, his seniority and his corresponding salary

. Now Well, you want to use a straight line to represent these data, and after

the representation , you can use this straight line to predict

how much salary this employee should be given

. Now that we have found this straight line

, we can make predictions. Action,

suppose the employee

who applied for the job today tells you that he has 3 and a half years of work experience

, which means that his seniority is 3.5 years

, then we can help him calculate how much salary he should give him.

Here , his If the seniority is 3.5,

we can use the found w final to multiply it by 3.5,

and then add the value of b, which is b final,

because we use a straight line to represent these data.

A straight line can be written as w multiplied by x Adding b,

now our x is 3.5,

and the unit of this is k.

He said that the seniority is 3.5, and he can give him about 59.88 k

. There are too many decimal points, so

I just give him 2 digits, er, one. That’s ok,

one decimal point is good to execute again

, then we need to give him 59.9 k.

If we predict it, we can probably give him 59.9 k.

There are more blanks here

. Suppose another employee comes

and he tells you that his seniority is 5.9 years. Then we

Just make a prediction. Here

I will type two more words to predict. The predicted

salary is easy to execute

.

If he tells you that his seniority is 5.9, then the predicted salary is probably 59.9k. If not, I will change it to 5.9. Execute

, the predicted salary is about 81.8 In

this way, our problem is solved.

We can predict salary in this way

. Next, I will draw some data into a graph to see . For example,

I can draw it into a graph to see the cost here.

If we want to draw a picture here

, we can use the drawing tool matplotlib

and then use numpy.

Let’s draw the process of the 20,000 update

cost decline. We can use plt.plot to draw it as a line

. The value of x here we use np.arange

20,000 times, so it is 0~20000

and the value of y is the historical data of our cost.

The updated data is stored here,

so we can display it to see Optimistic about the implementation, you

can see that he looks like this

, then I will add some tags and titles for him, and

set his title. The title here is a total of 20,000 updates

and its cost. So I will write iteration vs cost here

and then Then add an x ​​label for him. The

x-axis is the number of updates. It is iteration,

and the y-axis is cost. Execute it again

, so that the subtitle and labels are all up

. From this picture, we can find that the front of

him has dropped very quickly. But

later, it will be slower. If we want to look at the previous paragraph in detail

, suppose I only want to see 100 0~100

, then I will change it to 0~100 here

, and then I will write: 100

also It is the first 100 data,

so it can be executed

again. This is the descending process of its first 100 updates.

If you want to see other intervals, you can

also set it

yourself. It seems that there is one less a iteration , so you can execute it.

Finally,

we can also use w Follow the update process of b.

The 20,000 update process draws it with a picture

. Let’s first go to the previous cost function

and copy the 3D picture. The 3D picture is here

. We copy the action of drawing a 3D picture over

there . At that

time, we set w and b to be between -100 and 100

so

the calculation of w and b from -100 to 100 should also be copied

here . Here, w and b are from -100 to 100 and its cost

Here I will open another grid to do calculations

I will let him calculate first

, then we will use Chinese fonts here

, so we have to do the action of downloading fonts,

the same way I went to the previous cost function

, found the place where the fonts were downloaded, copied them,

and then opened a grid for him to download

, okay here Let's wait for him

and let him finish the calculation, installation and drawing

. OK, the drawing is finished.

I 'll adjust its angle

. It was originally 0 degrees and 0 degrees. I made it 20 degrees and -65 degrees , so I

can execute it again

. The angle is good,

the red one.

The point is the lowest point. Then

we draw the update process of w and b with a line. If

we want to draw a line, we can use the plot to pass in the w hist

b hist and the hist of the cost that

we saved , that is For the c hist

side,

let me run it first and see that he draws the line

, but I don’t know where the initial point is , so

I draw the initial point as well

. If I want to draw a point, I use scatter

Here I directly use the copied one

. If we want to draw the initial point, it is its 0th value

b hist is also the 0th value

cost is also the 0th value.

Here I make the color green and execute it again, and

you can see the green.

The point is our initial position , and then

it updates and updates all the way to this side

, basically reaching the lowest point.

Ok, it’s updated all the way,

so I think the color is a little bit of an eyesore

. The color of the surface is a bit of an eyesore

, so I changed the color I don’t think

it’s necessary to cancel the

border , so execute it again

and it looks better . I

set its opacity to the first point and let it be 0.3 and execute it again

. It looks much more comfortable. It

’s a Such a curved surface. Then

our green point is the initial point

. It is updated along the way. Basically,

it is about to reach the lowest point

, because the red point is the lowest point.

Well , I can also play around here

. If I am making a gradient When descending, I

set a different value.

Here I copy a copy to the following

and copy it

here

. If my initial point here is set at -100 and -100

, let’s try

it and see what the result will be like. Well, after

running

20,000 times, we will also get the final w and b

and the stored wb and cost for each time.

We will execute the drawing here again

to see what it will look like

. You can see this time. The initial point is here , and then

he goes all the way down, down, down, and down...

and then turns around like

this, which is also close to the lowest point

. Then I can try it if I

didn’t let him update it here. Many times

, I only asked him to update 1,000 times for easy execution

. I only asked him to update it 1,000 times. Let me see what it looks like.

He walked down this time

, only came here and didn’t move forward

because we didn’t update enough times If

there are many, you can also adjust other things

. For example, I can increase the learning rate

a little bit.

I set it to be 1.0*10 negative quadratic, which is 0.01 . Then

I will try again here

. He still has reached the lowest point , then

you may find that the initial point we set is not -100 -100.

According

to you, the initial point should be here

. Why is it here?

The reason

is because in fact,

we have stored it here. It is the result of the first update,

so he ran here after the first update

-100, -100 is the initial place here,

he ran here for the first update

, and the storage here is from the first update When the last result

is reached, I will try again

. If I increase the learning rate a little more

, let’s say I change it to 5.9 for easy implementation

. Let’s see what it will look like.

You can see that his stride this time is very large.

He has become like this. Back and forth,

but it seems to have reached the lowest point,

so what if we make it bigger?

Here , I change it to -1 of 1.0*10 and execute it. It’s easy to execute.

You can

see that something terrible happened

. Ours The surface in the picture is gone.

Why ?

The reason is that it has exceeded the range too much. When

we drew the surface, we set the values ​​of

w and b to be between -100 and

100. Now it has exceeded too

much. It will cause the surface to disappear.

When our learning rate is set too high

, it is possible that each update

is not closer to the lowest point

, but farther away from the lowest point.

In our example, every update is farther away from the lowest point.

The farther the lowest point is , so

it will lead to such a situation that

the entire surface disappears. The example

just now is like this.

If you set the learning rate too high, it may go farther

and farther away from the lowest point

. One step to this,

the second step to this

, and the third step to the point where he doesn’t know where he is going.

Here , everyone can play and see by themselves

, set different initial values ​​of wb

, and set different learning rates

and different settings. This

is the implementation of our gradient descent

. After completing the simple linear regression,

I will briefly summarize

the process of machine learning.

In our example

, the first step is to prepare the data first

. According to this The distribution of the data

, we think we can use a straight line to represent it

, so we use a straight line to represent the data,

and then we need to find

the straight line that best represents the data, that

is , the line that is most suitable for the data

. What kind of straight line is the most suitable for these data?

We always have to give him a standard for judging,

so we set it up. As long as

the sum of the squared distances between these data points and this straight line is smaller , then

we will call this straight line.

The more suitable these data are

, on the contrary , the larger the size, the less suitable it is. Well

, after we have a scoring method, we can never exhaustively

enumerate all the straight lines

and then score them all

and find out from them.

The best one , it

looks too inefficient

. We must find the most suitable straight line in

an efficient way. Here

, we use the gradient descent method

. In fact, the whole process

is probably machine learning. The process is the same when

you apply it to other examples

. First of all, you need to prepare the data first

, and then you need to help him set a model based on your data

. In this example, the model

we set is a straight line

. After setting the model,

it contains There may be some parameters that you need to adjust.

In this example, you need to adjust w and b

. How to adjust is called good

, and how to adjust is called bad.

We always have to give him a standard for judging,

so we need to set a cost function

. For this example, we set it like this

. Then it is impossible for us to exhaustively enumerate all parameter combinations

and then calculate their cost and

then use it . It is too inefficient

to find the best

one . We must find the best parameters in an efficient way.

In an efficient way

, that is, we need to set an optimizer

. The optimizer is translated into Chinese. Optimizer

In this example

, the optimizer used is gradient descent

, which is probably a simple machine learning process

. No matter what other examples it uses

, it is actually the same

. You must first prepare the data

and then set up a model.

Then set a cost function

and finally set an optimizer OK.

The second model I want to introduce to you

is called multiple linear regression. It

is multiple linear regression in English.

It

is

actually similar to the simple linear regression we introduced before

, but it can be compared. There are many features

so

let’s take a look at

what problem we want to solve today.

In

the previous example of simple linear regression

, we used seniority to predict salary,

but

think about it carefully. We only want to predict salary based on seniority

right

? It’s weird,

we

should consider more factors

, so now

we have collected more complete information. In addition to seniority

, we also collected his education background and where he works.

So now we want to use seniority, education background and work To predict his salary,

that is to say,

our features have more education and employment

. If we also want to use a linear model to represent these data , then we can use multiple linear regression.

Multiple linear regression uses mathematics. It can be written as y is equal to W1 times X1 plus W2 times X2 plus W3 times X3...

you can multiply all the way depending on how many characteristics you have, and

finally add a b

to that In our example, there are three features

, which are seniority, education and place of work

, so the mathematical formula will be long when written out. What

we want to predict is the monthly

salary. The monthly salary will be equal to W1 multiplied by the seniority.

Our first feature plus W2 Multiply by the second characteristic of education,

add W3 and multiply by the third characteristic to work

, and finally add a b.

Then our goal

is to find a combination of W1, W2, W3 and b

so that he can best express This

information is what our multiple linear regression needs to do

. Before we start to find the most suitable w and b, I

think everyone should have discovered a problem

. In our formula,

we multiply W2 by education

and then take W3 is multiplied by the place of work

, but we can see

that the two characteristics of education and work place are both text.

How to make the text into a number? This is not good

, so we must first combine these two features. Do some processing

and convert them from text to numbers

before we

can do the calculation here. Let’s deal with the feature of education first

. There are three possible values ​​for

the feature of education, which are below high school, university and master’s degree

. From this feature, we can see It can be concluded

that it actually has a relationship between high and low.

If there is a relationship between high and low, can we use numbers 0, 1, and 2 to represent these three situations?

0 is used to represent the smallest high school and

1 represents a university. 2 is a master’s degree or above.

If a feature has a size relationship and

a high-low relationship

, we can use this method to replace these words.

This replacement method has a name,

it is called label encoding,

and we replace the education just now. After the fall, it will become like this.

Come back and take a

look. We use 2 for master’s degree and above,

and 1 for university, and 0 for high school and below,

so here it will become 1 2 0

and so on

. Then we will first Let’s implement this label encoding.

First, read the data in.

The reading action is the same as before

, so I directly use the copied one.

However , in the part of the URL

, because our data this time has two more features.

So there are some changes.

We are on the side of salary data

because it is the second edition, so we write 2 at the back and

read it. We will display it to see

OK. Our data looks like this. There are three characteristics

in total , namely , seniority, education and In the place of work , we

want to deal with the feature of education first

. We want to convert it from text to number first.

Well , we can do the conversion in this way

. Let’s get the data. Let’s

take a look at the feature of education

. It’s called education level

and we’ll get it. This feature

, let’s display it first to see

what he looks like.

We want his university to be 1

, master’s and above to be 2, and high school and below to be 0.

Here we can do it like this, write a map

on the back and

put it in it. Write a dictionary.

We want to correspond to high school and below, let it correspond to 0

, then university, let it correspond to 1,

and finally, master and above, let it correspond to 2.

We write it like this, it will help us do the conversion,

and put this feature into

it The value inside is converted

like this. After the conversion, I will change the value of this feature

. After the change,

we

can display the entire data and see that it will now become the corresponding value. It’s worth it

. The first piece of information here is a university

, the second is above a master’s degree, and the third is below a high school

, so it will become 1 2 0

, and then the same conversion will be done below

. In this way, our label encoding is completed. After dealing with

the feature of academic qualifications,

let ’s deal with the feature of place of work

. There are three possible values ​​for this feature

, which are city a, city b, and city c

. Here, you may wonder whether

we can use the same method

. We use The label encoding

uses 0 1 2 to represent the cities a, b, and c,

but if you think about it carefully, do

you think this is okay?

We don’t know

whether there is a high-low relationship

or a size relationship between the cities abc

. If we If it is represented by 0 1 2

, whoever is 0 and who is 1 and who is 2?

If we want to convert this feature with no size relationship,

we can directly convert it into multiple features

like

this . We originally There is only one characteristic of a city

, so I changed it

to three characteristics of city a, city b, and city c. The first piece of information

here

is that he originally worked in city a,

so we gave him the attribute of city a. 1

Then the two attributes of city b and city c are 0

, so look at the second data. The

second data originally worked in city c.

So we give it the attribute of city c as 1

and others give it 0 and so on.

This method is called one hot encoding

as long as we want to convert a feature that does not

have a relationship between size and height

from text If it is converted into a number,

then we can use one hot encoding

, which will change from one feature to multiple features.

See how many values ​​this feature originally had

, and how many possible values ​​it has, then it will become several features.

Take our example

Say its possible values ​​are cities a, b, and c

, then it will become three features,

city a, city b, and city c

. After converting it like this,

in fact, we can also use one of these three features as Delete why ? The reason is that among these three features

, in fact, we only need to know the value of two of them

to deduce the value of the third feature

. Suppose we delete the feature of city c now. Let ’s see

if there is any way Deduce city c through cities A and B. It looks like this now .

If city a is 1 and b is 0,

then city c must be 0, because only one of these three features will be 1,

so let’s look at the second

If city a is 0 and city b is 0 , then city c must be 1,

because it is not city a, not city b, it must be city c

, and so on . Therefore, we have a way

to deduce the third through two

of the characteristics.

The value of a feature .

At this time, we can delete one of the features . Let’s take a simpler example

. Suppose

you now have a feature called gender . There are only two possible

values , either male or male. Girls , obviously, this feature

does not have a relationship between high and low, and there is no relationship between size,

so we can use one hot encoding

to turn it into two features that are divided into boys and girls.

But it is either a boy or a girl

, so we can

derive it from one of the features.

In this

way, we don’t need to divide it into two features,

because too many features will make our calculations more complicated,

so we can delete one of the features.

Okay, so go back.

In the original example, among the three characteristics

of cities a, b, and c

, we only need to know the values ​​of two of them to derive the third one. In this way, we can delete one of them

. Here we choose to The feature of city c is deleted

. Let me remind everyone

that not all the features that can be derived should be deleted

because

even if some features can be derived

, they have special meanings

or they can speed up our calculations.

Efficiency , then

we won’t delete it

. But in the example of one hot encoding,

we can delete one of the features after conversion

. Well, then,

we will directly implement one hot encoding

. Now our data It looks like this,

we

want to convert the feature of city

, from text to numbers,

so for this feature

, we can use one hot encoding to convert it.

Here

we can use

the preprocessing under

sklearn

and then Well, let's use the one hot encoder

sklearn suite in it.

It provides a lot of

things we will use when doing machine learning , such as

the one hot encoder we introduced now

, which can help us quickly convert

This one hot encoder is a category

, so let’s create it first

, let’s create a converter first

, or you can say create an encoder first

, here I call it the one hot encoder,

after creating it, we can first let This encoder

allows this converter to read our feature

, the feature of the city,

so here we can write some fit

and let him see the feature of city , but

here it is important to note that

it only accepts a binary input It is a one-dimensional matrix

, so we can’t write it like

this because it is a one-dimensional matrix. If we

want to make it two-dimensional, we need to add a pair of square brackets

and let him use two pairs of square brackets. It is a two-dimensional matrix

, let him read all the values ​​of this feature

, and then we can transform it.

We can use the transform under it to do the transformation

. It also passes in the feature we want to transform

, and here is the same

A two-dimensional matrix is ​​required

, so two pairs of square brackets are required . Here I will call the converted result city encoded,

and then we will display it directly.

After execution , you can see that the converted result looks like this. Why does it look like this?

The reason is because of this one

After the hot encoder is converted, it is preset that

what it will send back to us is a sparse matrix

. OK, it is a sparse matrix. It doesn’t matter

if you don’t know what a sparse matrix is. What

we want to see here is

that it will send us back a The

complete matrix, we can write toarray later

for the complete matrix, so it will return the complete matrix to us and execute it

again .

You can see that this is the result we want.

It will put a total of 3 in a feature The possible cities a, b, and c

can be turned into 3 features

OK and 3 values

. Then we can

replace the result of the conversion with the original feature

and replace the original feature of the city

. We can do this

Our original data is data.

Let’s execute it once

OK. The original data looks like this.

If we want to add 3 more features for him

, that is, if we add 3 more columns,

OK is 3 columns . If we want to add 3 more columns and 3 more features for him

, we can write it like this.

Write

two pairs of square brackets

. I want to help him add a cityA,

then a cityB,

and finally a cityC,

and then we Its value can

be specified as the result after the conversion we just made

, and then we can display it to see. You can

see that writing it like this will add 3 more columns

and 3 more features cityA BC

and then we can put The feature of city is deleted

because it has been converted into these 3 features.

Then, among these three features,

we can delete one of

them. I delete cityC , then

we can do it like this

. We can write data.drop

Then write the features we want to delete in the first parameter. The features

we want to

delete are represented by a list

. We

want to delete the two features of city and cityC. Then write the second parameter that we want to delete. Where is the axis of the

deleted thing? The axis we want to delete is a column.

City and cityC are two columns. If the thing

we want to delete is a column

, then we need to specify the axis axis as 1.

If we If the thing you want to delete is a line

, then here we need to specify the axis as 0

Okay, so here we write 1,

and then we will display the deleted result to have a look.

Execute

, you can see that

the two columns of city and cityC are

gone now, so we will put the work place This feature is also processed .

After converting the text , we usually divide

the data into a training set and a test set when training the model.

Training the model is to find the most suitable parameters

. In our example , it is to find the most

suitable parameters.

Appropriate w and b

Usually when we train the model, we

don’t use all the data

, we only use part of it,

what to do with the other part

, and the other part is for testing

, because you think, if we Use all the data for training

, and then find out the best set of w and b

, but how do we verify the effect of this set of w and b

? Do we want to test it

? If we want to test it We will test it on unfamiliar data.

We will not use the training data for testing

, because the training data machine has already seen it,

which means that he already knows the answer

. If he already knows the answer,

we If we still use this data for testing, it won’t be accurate,

so in order to have unfamiliar data for testing

, we usually divide the data into training sets and test sets,

like this

, assuming we have 10 data

. Usually, the training set will account for about 70% to 80%.

In this example, we account for 80%

, so 8 data will be the training set

, and then 20% of the data

, that is, 2 data will be the test set.

The training set is to take To find the best w and b,

the test set is after we find the best w and b

and use it for testing to see how it works

. After understanding what training sets and test sets are, we will directly implement them.

OK Now our data looks like this.

First of all

, let’s separate x from y

. Everyone should remember that the model

we want to use is multiple linear regression

. Multiple linear regression is written as a mathematical formula

, which is

y=W1*X1+W2

*X2 is multiplied all the way down to see how many characteristics you have.

Finally , add a b

. In our current example, there are 4 characteristics

, namely seniority, education and job location . The y we want to predict will be salary , so we First separate x and y. Here,

x is equal to the four characteristics we need to obtain it

. The first is seniority,

the second is education , then cityA and cityB

, and then y will be equal to our salary.

Here I display x and y to see that

x is these 4 features

, and then y

will be the salary. Then

we can divide it into a test set and a training set

. Here we can also use sklearn, which is very useful. The tool

uses the model selection under it

, and then we introduce the train test split under it,

and then we can use it to help us divide it into a test set and a training set.

In the first parameter, we pass in our x, which is the feature,

and then the second Enter y in the first parameter,

and then we can specify the size of the test set.

Here we can write the test size

. I want it to be 2. Here,

I can make it equal to 0.2

and write it automatically. Take 2 components of our data as the test set

and the other 8 components as the training set

. If you want the test set to account for 30%

, you can write 0.3 and so on

. If you write it like this, it will return 4 The values ​​for

us are x for training, x for testing

, and y for training and y for testing,

so I will call the first one x train

and the second one x test

and then y train and y test

will return these 4 values ​​to us

, then I will display it for a look.

I will display x train, which is the x we ​​used for training.

You can see it and take out these data as a training set

. Let's take a look at its length

. Let's take a look at its length

. Its length is 28.

Let's take a look at the original length of x.

The original length is 36.

Let's take 20% as the test set, which is 80% As a training set,

let’s take 36 and multiply it by 0.8 to see what it is. It

is 28.8

. So, if it is 28.8, it will automatically eliminate the point 8.

Take 28 data as a training set

. Let’s take a look at the length of the test set. How much

can you see in total? x has 36 pieces of data

, and after we divide it

, 28 pieces are used as training sets and 8 pieces are used as test sets . The same is true for y here, and

we display x train

Come out and take a look here, and

you will find one thing here

, that is, the

x train seems to change every time we execute it.

Let’s take a look at the first data. It is 5.1 0 and then 1 0.

We execute it again

and it changes to 6.9 2 1 0 Execute it again

and change to 7.8 2 0 0 Why did it change?

The reason is because of the splitting process.

It will split randomly for us by default

, so the result of each split will be different.

If you want him to split If the result is fixed,

here we can also set another parameter

called random state,

and then you can specify a number for it.

Each number will correspond to different segmentation situations

. Suppose the number I gave him is 87

, so we

can execute it again. Seeing that the result of its division this time looks like this,

the first stroke is 4.6 1 1 0.

Now, when I fix the random state equal to 87, I

execute it again

, and you can see that it will not change.

But if I change this number to

It makes changes.

If I change it to 86

and execute it again,

it will change again

. If I continue

to execute it without changing 86,

it will not change. So

if you want to fix the split result here

, you can give He specified a number,

here I will specify 87 for him

because I want the result of its segmentation to be fixed

, so that it will be convenient for me to do a demonstration later

. Okay, then I will also display it in the x test to see how long

it is like this. 8 pieces of data

, then y train will also display it

OK, there are these,

then y test

is done, then we have successfully divided the test set from the training set

. Finally, for the convenience of subsequent calculations,

I will first put x train and x test convert it into numpy format,

now they are both in pandas format

, so it will be very beautiful after execution,

OK, there will be such a grid

, if I convert it into numpy format, it will look like this,

I can write after it to numpy

, it will become a matrix, and it will be ugly if it

becomes such a matrix

, but it will make my subsequent calculations more convenient

, so here I will convert it first

. The x test is the same,

and I will also convert it It can make my follow-up calculations more convenient

. After converting the text into numbers

and dividing them into training sets and test sets,

we can return to the model.

The model we want to use is multiple linear regression

, so it can be written like this

, but our features have changed from 3 to 4.

The place of work

has become two features of cityA and cityB

, so we have to rewrite it

Now that

it has become like this, our goal now

is to find a combination of W1 2 3 4 and b, so that

the monthly salary predicted here can be closer to the real data

. Well, let’s implement this part

first . Let me set it up first.

The value of w and b

, at the beginning, I set w randomly, it has 1 2 3 4,

so I use a one-dimensional matrix to represent it

. There are 4 values ​​in this one-dimensional matrix , so here I first introduce numpy and call it np

Next, I create w

and let it be a matrix with 4 values.

Suppose I let it be 1 2 3 4.

This 1 2 3 4 means our W1 2 3 4

and then we need to set a value of b

In the same way , I set it randomly and let it be 0.

After setting w and b,

we want W1 to multiply the first feature

, W2, the second W3, the third W4, and the fourth . Now because

We are in the training phase , so

the feature we want to multiply is in the x train

. I first display the x train , and

it looks like this

. It has a total of 4 columns and each column represents a feature

, so what we want is

this 1 is multiplied by the first column, then 2 is multiplied by the second column,

3 is multiplied by the third column, and 4 is multiplied by the fourth column.

If we want to multiply like this

, we can directly write it like this.

We directly multiply x train by w

, then it will be It will automatically multiply 1 by the first column, 2 by the second column,

and so

on. Let’s execute

this. This is the result of the multiplication. The first column

here is W1 multiplied by X1 , the second column is W2 multiplied by X2, and so on.

Then what we want is for them to add

W1 multiplied by X1 plus W2 multiplied by X2 and then add 3 plus 4

, that is, we want to add each row here

, we want to compare each row If you add it, you can do it like this.

First put the calculation result here in brackets

, and then use some sum.

If I only write this way here,

it will sum up every value here

. But what we want is every Well, what

we want is to sum up in a row.

Then I can set the axis equal to 1

and equal to 1 to do the sum in the horizontal direction.

If you want to do the sum in a straight line, you can also set it to be equal to 0

Okay, let's try to execute it

. This is the result of the sum

. Each value here is the sum of each line just now

, which is the result of multiplying W1 by X1 plus W2 by X2 plus 3 plus 4.

Here After all the additions are done, we need to add a b

at the end

, so if I add a b here, if I add b

here, it will add b to every value in it

, but now we set b to 0, it seems to be It doesn’t come out

, so I set it to 1

, let’s execute it

again, then add it like this, every value here will add 1

, so every value calculated here

is our predicted salary,

I call it y

pred Next, we want to find out the most suitable combination of w and b.

To find out the most suitable combination of w and b

, we must first define what is called the most suitable

, that is, we need to set a standard for judging.

That is, we are going to set a cost function.

The cost function here

can be set the same as the previous simple linear regression

. Because we also want the predicted monthly salary

to be as close as possible to the real data.

So I set it as real. The reason for

subtracting the predicted value from the data and

then square it is because there may be negative numbers in the subtraction here. For the convenience of calculation,

we

directly square it so that there will be no negative numbers

. So

what is our current goal?

It is

to subtract the predicted value from the real data and then square

it . The smaller the value, the better

, that is, the smaller the cost here, the better . Then we will directly implement the cost function

. Our cost function is set to subtract the real

data. Drop the predicted value and then square it.

The predicted value has been done in the previous step

. The y pred here is the predicted value. Let

me display it first

. This is our predicted value. What about the

real data because we are now In the training phase

, the real data is y train , so we use y train to subtract y pred.

I also display y train first to see

if it is OK. There are a total of these sums.

We use y train to subtract y pred

, which is After subtracting

each sum, we will square it for

easy execution. You

can see each value here

. It is the result of subtracting the predicted salary from the real salary

and then square the result

. We hope The smaller the value here, the better. If you want

the value here to be as small as possible

, we can calculate its sum or calculate its average.

Here I will calculate its average. I

hope its average should be as small as possible

. Calculate it You can enclose it in brackets and write “mean” at the end for

easy execution.

You can see that this is the average result

. Here I will directly write the cost calculation process as a function , which

can facilitate us to bring in different w and b.

I call it compute cost.

It also passes in x and y, which are our real data

, and then passes in the values ​​of w and b

. In it, the value of y pred is calculated first, which is our predicted value

, but here is x In train, change it to x

to calculate the predicted value, and then we can calculate the cost.

The calculation of cost is here, and

the words here are the same. In y train, I changed it to y, the

real data, subtracted the predicted value, squared it, and then took the average.

This is If the cost

is good, the last thing is to return the cost

and try it directly.

Here we will use it directly. The compute cost

here is to pass in x train and y train

because we are in the training phase now.

Then w and b are the same as I pass in first.

Enter these two values ,

first pass in these two values

, let’s execute and see, and you can

see that the calculation result is the same as above

, if I set b here to 0, then set this to 0 2 2 4

Well, execute it

again. The cost calculated by the combination of w and b in this way is even higher than the one just now. The one

just now is 1,772 .

How about this

? Let’s define the evaluation standard, which is the cost function. After

setting the evaluation

The standard is after the cost function,

and then we need to use an efficient way

to find a set of w and b, which

can make the cost as low as possible. The efficient way is to set an optimizer, which is the optimizer

. In our example,

we can also use the gradient descent method

. You should remember that

it changes the parameters according to the slope

. However, when we used the simple linear regression before,

the parameters were only two w and b

. Now our parameters are Becomes 5 W1234 and b

Let's take a look at how to update the parameters before.

If we want to update w

, we need to subtract the slope in the direction of w * learning rate from w. To

update b, we need to subtract the slope in the direction of b * learning rate from b

. In fact, the parameters are updated. The method is the same,

but

we have changed from two parameters to five parameters

.

Now we have five parameters W1234 and b. To update these five parameters,

we need to subtract the slope of their direction*learning rate individually .

How to calculate the slope in each direction ? In fact,

it is the same as before . We only need to differentiate the Cost function

to get the slope in each direction

. Our Cost function

is to subtract the predicted value from the real data and then square it to

write it in a mathematical formula.

The sub is to subtract y pred from

y and then square it. We can spread out the y pred to become W1*X1+W2*X2...

multiply to 4 and add b at the end

if we want to know W1 The slope of the direction is to differentiate it with respect to W1,

or to be precise, to make a partial differential with respect to W1

. If what you want to know is the slope in the direction of W2,

then you need to make a partial differential with respect to W2

, and so on. W3, W4, and b are both It’s the same. It’s okay if you don’t know what differential is at

all here

, because there are many tools that can help us to do differential automatically,

so the slope in the direction of W1 looks like this after calculation, and

you can see a very long string , but

in fact, we can I found that this string of

W1*X1+W2*X2 plus 3 plus 4 and then adding b

is actually y pred,

so I simplified it and it will look like this

, it is 2*X1 and then multiply Use the result of y pred-y and then

use the same method to calculate the slope in the direction of W2. It will grow like this

. We can find that in fact, he just replaced X1 with X2, and

everything else is the same

. From this we can see that the slope in the direction of W3 is the If the side is replaced with X3,

then the slope in the direction of W4 is replaced with X4, and

the slope in the b direction is calculated to be longer.

Here , there is no need to multiply an x

. Here we can see whether

it is in the w direction or in the b direction. In

the previous place, there is a multiplication by 2.

I don’t know if you still remember.

In fact, we have said that this multiplication by 2 can be omitted

, because when we update the parameters

, we will multiply it by another learning Rate, so

this is multiplied by 2. In fact, we can leave it to the learning rate to

multiply. Here, we can omit it all

and become like this

, so that we know how to calculate the slope. After

we return to the original place

, the slopes in all directions are now. We will forget it

, but it will be multiplied by a learning rate later

. How to set the learning rate?

Just like before, it is through testing and experimentation

. You can’t set it too large

, because it may directly cross the lowest point

. Or it is farther and farther away from the lowest point

, so you can’t set it too small

, because it may reach the lowest point forever. After

setting the learning rate and calculating the slope

, we only need to keep updating the parameters.

Let it approach the lowest point step by step

, and then we can find the most suitable combination of w and b we want

. Then we will directly implement the gradient descent.

First, we will calculate the slope in each direction

because the slope in the b direction is relatively simple.

So I calculate it first, I

call it b gradient , it is y pred minus y,

now it is the training phase, so our y is y train,

this y pred has been calculated before

, so I just copy it , but we Now it is the training stage

, so I will change the x here to x train

, and I will display the calculated results to see

, because our training set has 28 records

, so there will be 28 values ​​here

, so let’s do it

An average , put

it in brackets and then use the point mean

to calculate

the average

. The average is -46.94 . After

calculating the slope in the b direction, let’s calculate the slope in the w1 direction

. I call it w1 gradient,

which is y pred minus Drop the y train

and then multiply it by x1

. This x1 is actually our first feature

. We are in the training phase now,

so our feature is in the x train

. I will display the x train first.

Seeing that it shows a total of four columns

, these four columns are our four features

. Here, x1 is the first feature,

which is the first column here

. If we want to get the first column here,

we can write it like this. Write a pair of square brackets and then:, 0

means

that we want all the values ​​in the first dimension

: means all of them, and

the comma 0 means the values ​​in the second dimension.

We only need the most The first one,

we can see that this matrix

is ​​a two-dimensional matrix

, we want all the values ​​in the first dimension

, that is, we want all the values ​​in the first square brackets

here . We all need

the first value of the second dimension. The

second dimension is every row here.

We only need the first value , which

will be the first column here .

We execute it, and we

can see that it will take out all the values ​​in the first column

, which is our x1,

so here I can replace it

, replace it

, and then we will display the calculation results

to see. It will calculate a total of 28 values , because

we now have 28 data in the training set

, so let’s take its average

, enclose it in brackets and then take the average and execute it

, so that

the slope in the direction of W1 is calculated

295 -295.8

and then the calculation method of W2W3W4 They are all the same

, so here I directly use the copied one

. Here we change it to 2, 3, 4

, and then we want to get the second column

. Here is 1, then 2, 3

, and we will display the results of the execution. Come out and see

W1, W2 OK execution,

you can see that this is the slope of W1 W234 in their direction

, here I use a loop to write it, it will be more convenient

, so I recreate a w gradient

, which stores the slope of W1234 direction,

here I will First create a matrix with all 0s in it

, let it have 4 values

, so I can display it first, and I will let it be 0 first if there are 4 values

​​in it

, but it seems not good to directly write 4 here.

The meaning of 4 is

How many features do we have? How many ws do we have

because we have several features

? If we want to know how many features

we have, can we directly get x train?

If we want to know how many columns x train has

, we can do this here. Let's get its shape and it

will tell you that it has a total of 28 rows and 4 columns

, that is, it has 28 values ​​in the first dimension

and then There are 4 values ​​in the second dimension

. The value of the second dimension is what we want

. We want to get the value of the second dimension.

Here , I can get 4

with square brackets 1.

So this It will be better if I change it to this

Then I use a loop to ask him to calculate the slope here.

He will run it four times in total

, so I will paste it here directly

, and the i-th slope of w gradient

will be the calculation method here.

But here I will change it to i

, so it should be no problem.

I will also display the result after the calculation

. You can see the slope W1234 in the four directions

is these four values.

Well , what about this

? Calculate the slopes of the w and b directions. This

is the w direction and this is the b direction

. Next, I write the process of calculating the slope as a function

so that it can be used later

. I call it compute gradient

. Pass in x and y, which are our real data

, and then there are the values ​​of w and b

. Here we first calculate y pred . Then

y pred has already been calculated. I will paste it directly to

the calculation of y pred. After finishing, I create a blank matrix

, create a matrix with all 0s in it

, and then change this side to x

and then use a for loop to calculate the slope of each w,

here also changed to x and here changed to x

Then after the calculation here, we

still need to calculate the slope in the b direction

. The slope in the b direction is also pasted here

.

It is also changed to y here. After the calculation, we will return the result to

w gradient and b

OK to execute

. Then we Just use it directly and see here

. The x to be passed in

is now the training phase, so x and y are

both x train y train w and b. I pass in the value here

, so let’s execute it

and see. The result is exactly the same as above,

w and b are the same.

Let’s make a change. Let’s say I change b to 1

and then change this to 1 2 2 4

and execute it again

. Then the calculated slope will be different.

It’s easy to calculate. After the slope,

then we try to update the

parameters. To update w, we need to subtract the slope in the direction of w from w

, and then multiply it by a learning rate

. It is the same when updating b

. Just change this side to b

and then the learning rate here.

Let me set it casually. Suppose I let it be 0.001 as

the initial value of w and b. I also let it be like this.

The slope in the direction of w and b. We will directly call this function to do the calculation,

and it will send the slope back to us.

In the direction of w and b,

here I will display the updated results of w and b to

see if there is any difference. You can

see that it was originally 1 2 2 4

and then it becomes 1.2 more than 2.0 more than 2.0 more than 4.0

and then b There

are also changes. Let's see if its cost

has really decreased after such an update

, that is, has it really gone down?

Here, I will directly use the function here to calculate the cost.

Calculate it before it is updated

, and then I will print out the result

. After the update, I will display it again

to see if it is really smaller

. You can see that it is really smaller. It was

originally It is more than 1,800 and then becomes 1,675.

After confirming that it has become smaller, we need to repeat the actions

here to let him keep updating the values

​​of w and b. Then we write it as a function

called gradient descent

In fact, we have already done

this function in the previous simple linear regression. The writing method is basically the same

, so I directly copy it, so we can execute it directly

. After execution, we can call it

directly. We have called before

, so I copied it directly

, but there are a few places to change

here . The initial values ​​​​of w and b here, I changed it to this

learning rate, I set it to 0.001,

so This side can also be written like this,

I just let him run it 10,000 times first , well, what about this side

because

we have now been divided into a test set and a training set

, so here we are now in the training phase, I write x train and y train

after The places are basically the same and we don’t need to change it

.

The names of compute gradient and compute cost are the

same. Well, I will execute it

and you can see the errors it finds

. The reason for the error is because of the error here.

We use the numpy matrix format for

w and w gradient.

Here , we use the numpy matrix format

for w and w gradient.

If we use the numpy matrix format

, there is no way to write directly: .2e to put here Format conversion

, so let me delete it first and see

if we execute it again

, so there should be no problem

. You can see that there is no problem.

You can see that the cost is decreasing

. Then the value of w The value of b and b are both being updated here

, but if we don’t format it like this

, it will

look ugly and uneven

. Here are the slopes in the direction of w and the slope in the direction of b

because w has W1234, so there are 4 values

​​here , and there are W1234, so there are 4 values

, but now it is ugly to execute.

If we want to make it more beautiful

, we can directly set the format of the numpy matrix that it prints out

. By the way, we can set np.set_printoptions

to set its formatter. If it is a dictionary

, then I want the format of the floating-point number to look like this. The colon is

blank. The format of 2e is written at the end .

How about setting it like this? It is equivalent

to writing in this way,

it is equivalent to writing in this way, it will print out every value

in this matrix in this format

, then let’s execute it again

here ,

and you can see here It

has become much more beautiful. Let’s take a look at the place where the cost

has been decreasing. There is no problem

. After running 10,000 times, we can find that

the cost seems to be still decreasing.

Here

I will test to see if I let Its learning rate is a bit

higher , well, it seems that it is still declining

, and the decline is obviously faster than before,

so there should be no problem with this learning rate

. Slowly, it seems that the speed of decline is getting slower and slower.

Here , I will first set it like this and then let it It’s good for him to run 10,000 times

. Here, everyone can go and play for themselves

. Set different initial w, different initial b,

different learning rate, and different times of running. What kind of results

will there be?

Here I will put it at the end. Find out w and b

and display it to see.

You can see that the w and b it finally found are these 5 values

. Here I want to verify whether these 5 values ​​are good or not.

We can use it in On the test set, we

take the w we finally found and multiply it by x test, which is the test set. This multiplication is W1*X1 W2*X2, and

after multiplying 3 and 4,

we need to add them together, so I put

It is enclosed in brackets to add it.

I want it to add in the direction that the axis is equal to 1.

At the end, we need to add another b

plus the b we finally found.

The result calculated here is our prediction on the test set.

I call it y pred

and we can compare

the difference between the prediction on the test set and the real value

. Here I display it as a Pandas dataframe The format will be better. It

will have a grid

. I let him have two columns.

The first column is our prediction result y pred on the test set

, and the second column is the real data y test on the test set.

Let’s see if there is any difference. How many executions? If there is an error

here , type an extra u and execute again

.

It is

still possible to execute again if there is wrong data.

You can see

that the column on the left is the result we predicted on the test set,

and the column on the right is the real data on the test set

. We predict It is more than 4.0 and the real 43.8

does not seem to be much different.

67.7 72.7 seems to be worse.

61.6 60 OK. The difference feels

okay. Well, then we

successfully implemented the gradient descent

and found the final w and b

. Use the final set of w and b on the test set.

The result is like this

. Finally, if we want to know more accurately whether

this kind of prediction is good or not , we

can also calculate it besides directly looking at their errors. Because the value of

cost

is our criterion for judging good or bad,

so here we can calculate its cost

, then we directly use compute cost.

Now we are the test set

, so x is x test, y is y test

, then w and b are We finally found the w final and b final,

let’s calculate and see,

its cost is more than 18.1 , then let’s take a

look and compare

the cost during the previous training

. When we are training,

the final cost is reduced to 2.52*10. It is more than

25.2, which seems to be no problem

, because our cost on the test set

is smaller than that on the training set,

which means that

its performance seems to be better than ours on the training set

, so after the cost calculation and the above crossover After the comparison,

if the error is acceptable to you

, we can apply this model to the real situation.

Suppose

someone actually came to interview today. He told you that his seniority is 5.3 years,

and his degree is above a master’s degree. After the interview in city A

, I think I want to admit this person, but I don’t know how much salary I should give him

. At this time, we can use the trained model

to predict how much salary we can give him

. First, we must first Do some processing on the data

, because there are texts here, we need to convert it into numbers first

, how to convert the data on the test set, and how to convert the data

in the real situation

, I will directly convert here.

Seniority 5.3 is no problem

Use it directly. After

master’s degree or above, we use label encoding to convert it into 2

and then work.

We use one hot encoding to turn it into 3 features

, and then delete one of them, so it will become 2 features

. So here is city A

The third feature is 1, and the fourth feature is 0.

Here I use a matrix to represent it.

I call it x real

, which is equal to np.array.

If you have feature scaling here, it

is the same in real situations. We also need to do feature

scaling. How about feature scaling? Let me look up and see

where our feature scaling is done. Here is

how our test set is scaled

. Then we will do scaling in the real situation.

OK, I will directly copy

ours . The x test is scaled in this way

, so the x real is also processed in the same way

. Let’s display the zoomed results to see. He

said that there is a problem

. Here he said that only two-dimensional arrays are accepted

, not accepted. It is not

wrong to enter one-dimensional. I forgot to

transform. It only accepts two-dimensional matrices

and does not accept one-dimensional ones.

So here I add a pair of square brackets

and it will become two-dimensional

. It can be

executed again. Seeing that this

is the result after zooming, then we can apply it to the model

. Is this how our model is calculated?

I just copied

it here. What we want to apply here

is x real, so here

Change it , then let’s

display the last one as y real, and

you can see the prediction of the final model

. It tells us that we can give this person a salary of 6.55*10

, which is 65.5k

. If a second person comes for an interview today and

he tells you that his work experience is 7.2 years

and he has a degree below high school, then

the place of work is in city B.

If we want to predict how much salary we can give this person

I ’ll just add this piece of information and write another one here

directly .

It’s 7.2 and below high school is 0 for us. After the conversion, it’s 0.

City b is 0, 1

OK after the conversion. This is the second person

We do the same prediction execution,

and you can see the prediction given by the model

, that is, you can give this person a salary of 23.4k

. This is how to apply the model to the real situation

. We have already done the gradient descent in the previous step.

In fact, In our example,

we can accelerate the gradient descent. We

only need to use a small technique

called feature scaling in English

, which can allow us to achieve the effect of acceleration

.

Well, let’s take a look first. The current example has four features,

namely seniority , education and work places cityA and cityB

, and then we use multiple linear regression to predict his monthly salary

, so it will be written like this

: W1*X1+W2*X2 plus 3 plus 4, and finally add

X1234 on the side of b. For

our four characteristics

, we can find

that their distributions are different from the values ​​of these four characteristics. For the first characteristic, its distribution range is between 1 and 10

, and the second characteristic is 0 to 2.

The third and fourth features are either 0 or 1,

so the range of their distribution is like this.

We can clearly see that

the distribution range of the feature x1 is larger than the other three features , so we take it

Going back to the original formula, it

will look like this.

Our w1 is multiplied by x1

, so it is multiplied by a relatively large value

, and W234 is multiplied by a

relatively small value. If so ,

let our gradient descent be slower

, that is, make our gradient descent slower.

Why ?

Because you can see here that W1 is multiplied by a relatively large value

, and W234 is multiplied by a relatively small value

. In other words, this W1 only needs to be multiplied A slight change

will greatly affect the calculated result here

, because it is multiplied by a relatively large value,

it will greatly affect the calculated result here

, and indirectly affect the calculated cost

That is to say, as long as there is a slight change in w1, the

calculated cost will also change a lot

So if we use W1 and W2 to compare their corresponding costs here, it will probably

look like this

. Here I use a contour map for comparison.

The center point is the place with the lowest cost, and the cost is higher as you go to the outer circle

. Then the x-axis is W1 y The axis is W2.

We can see that this contour map is long and narrow.

Why ?

Because W1 here is multiplied by a relatively large value , so

as long as there is a slight change in W1, it will greatly affect the cost here.

We said that the lowest point is the place of the red dot

, and the further you go to the outer circle, the greater the cost.

Here you can see that as long as there is a small change in W1

, the change in cost will be great

. On the contrary

, the change in W2 is not so good.

Affecting the

value of cost Let's take a look at what will happen if we do gradient descent in this case.

Suppose our initial point is here

, the initial w1 2 is here,

we update the gradient descent parameter

, it is very likely to happen like this It

can be found that it oscillates here and there.

Why? Because

we said

that as long as there is a slight change in w1, it will have a great impact on the cost.

When we are doing gradient descent to update W1,

it is very likely that it will happen accidentally. The update is overdone,

so if you accidentally update it here

, you will end up here , and if

you accidentally update it

, it will cause such a back and forth oscillation,

and this will make us reach the lowest point very quickly.

Slow , that is , our gradient descent will become very slow

, so how do we solve this problem? It

is very simple

. The reason for this problem is that the size and range of the features are different.

Our first feature has a relatively large range and then

The others are relatively small

so

this problem is caused. If we want to solve it,

we can change the scope of the features to the same

. I will change the large here to small. Let’s

see that the four features are all in the same range.

Afterwards our

contour map will look like this

. If we are doing gradient descent now,

it will be

very smooth , and we will go directly to the lowest point . So , we only

need to scale the size of each feature to the same range.

There are many ways to speed up our gradient descent

to do feature scaling. Here I will introduce a very classic and commonly used one

called standardization in English.

Its method is to subtract our feature

from the average of this feature and then divide it by The standard deviation of this feature, in terms of our seniority feature

, is to subtract the average of seniority from each data of seniority

and then divide it by the standard deviation of seniority. It doesn’t matter if you don’t know what standard deviation is,

because

there are many tools.

It can help us to do calculations automatically

, so if we

standardize each feature, it

will look like this. You can see that their size ranges will be very close

. You may have doubts about the three behind

us . Aren’t the features inherently small?

Why do we

still need to do feature scaling? There’s nothing wrong with them

; Let's implement feature scaling directly

. If

we want to do feature scaling, we do feature scaling after dividing the test set and training set.

We divide the test set and training set here,

so we do feature scaling after they are divided.

The way we want to do feature scaling

here is standardization. If we

want to use

standardization, we can directly import the prepossessing under it in sklearn and then introduce the standard scaler

. We can use it like this

because it is a category, so I will create it first.

It is called scaler.

After the creation, we will let him read the feature data of the training set . Note

that he can only see the feature data of the training set here

, but not the feature data of the test set

, because we only have the data of the test set. It can only be used when doing tests.

Here we can write some fit

and pass it in x train .

After passing it in

, it will calculate the average, standard deviation, etc. of the features in it.

Well , after it is calculated

, we can Let’s convert it.

To do the conversion, we can write some transform

and pass it in the same way as x train , so that

it will do the conversion

. Let’s replace the original result with the result of the conversion.

Well , let’s display it to see

. This is The result after conversion

, here I also convert the x test by the way

, here we just do the conversion first and don’t use it

, so that

it will be more convenient when we do the test later, and here we just replace the converted result directly Pay special attention

here

. We will not calculate its standard deviation or average

based on the data of the test set,

and then do the conversion.

We will directly use the result of the training set fitting

, which is calculated by the training set here. The resulting standard deviation and average

will be directly used on the test set for conversion

, so I will display it directly to have a look.

This is the result of the conversion of the test set

, and then we will compare it directly

. After feature scaling, we will do it again.

What is the difference between

gradient descent

and gradient descent without feature scaling ? The gradient descent we made before is here

. This is the result.

I will directly copy a copy

and execute it again here. All the parameters

here are the same

. The only difference? It is because our x train has undergone feature scaling

, so I will execute it again under the same conditions

to see if it is really faster.

You can see that its cost will reach 2.5*

when it is updated about 1,000 times

10 times, and then the following 2,000 to 3,000 updates

, basically the cost has not moved,

so we can guess

that it has almost reached the lowest point.

Let ’s compare the previous results

. It has been updated and updated to about 4,000 times.

It has only reached the first power of 2.5*10

, so we can clearly see that

after feature scaling, we are

indeed doing gradient descent, and its descending speed is relatively fast

. Next, let me introduce to you

that we are very good at classification problems.

A model used is called logistic regression. English is logistic regression

. Let’s first understand the problem we want to solve today

. Suppose we know whether a person has diabetes or not

. It may be related to his age, weight, blood sugar and gender,

so we will collect it.

After collecting some information, each piece of information

here represents a person ’s age, weight, blood sugar and gender, and

whether he has diabetes. If 1 means he has diabetes, 0 means he has no diabetes

. So we just want to base on these 4 Features

To predict whether this person has diabetes

, there are only two possibilities for the output value of whether he has diabetes,

either 1 or 0

, that is, he has diabetes or he does not have diabetes

, which is much different from our previous two examples. In our

first two examples, when doing simple linear regression and multiple linear regression,

The prediction is the monthly salary,

so the output of the monthly salary may be infinitely diverse

. Unlike here, there may only be two

types of output. Either you have diabetes or you don’t

. The possibility of this output is limited.

We will say that it is a classification problem. It’s a little vaccination on

the classification

side. These materials are randomly generated by me

, so if you have a medical background, don’t take it too seriously,

and you may also say

, isn’t it true that if your blood sugar exceeds a certain level, you will have diabetes? Let’s treat it

like this first. I don’t know

, so let’s take a look at it with a graph

. This is an example of linear regression before.

We want to use seniority to predict salary

. In fact, the characteristics can not only be seniority,

but also work place, education, etc.

Here It is for the convenience of presentation, so I only show the seniority

. We said that we can use a line to represent these data

. Let’s go back to our example today.

We want to use a person’s blood sugar to predict whether he has diabetes

. The characteristics of this side can also be Many of them

can have the weight, age, gender, etc. just mentioned

. Here, I will only draw a feature of blood sugar

for the convenience of the graph.

We can see that the distribution of the data here is much different from that on the left.

Because here Its output value has only two possibilities

, either 1 or 0

, that is, with diabetes or without diabetes

, so it will look like this

. At this time, you still think that we need to use a line to represent these data, which

seems wrong, right

? What should we do?

In fact, we don’t need to make too many adjustments

. Just bend the line a little bit to look like this. Does

it seem like it can represent these data?

The value of this curved line can only be between 0 and 1

. It is very consistent with our data.

Because our data is either 1 or 0

, the question now becomes how do we bend it. It

is very simple

. We only need to use a function called sigmoid function

in Chinese. It can be called The s-type

function can achieve the bending effect

. Taking our previous linear regression example,

our mathematical formula can be written as y=w*x+b

or if you have many features,

we can write y=W1*X1+W2* Add X2

all the way to see how many features you have. If we want to bend such a linear model

, we just need to bring it into the sigmoid function.

The sigmoid function looks like this

, it is a power of 1/1+e

It sounds like a tongue

twister, so we just need to take this linear model

and add a minus sign to bring it to the power to make a bend.

Here you may have a question

, what is the e in the English letter? What does it mean?

This e is just a constant representing 2.7182818....

Just like our mathematical pi

, it is just a constant .

The position of the square and adding a minus

sign can achieve the bending effect

. Before bringing it in,

I will first change the y here and change it to z.

OK, change it to z.

Then after bringing it in

Our new predicted value y will become like this

. Bring this string into it to see how many features you have. Here is how many times you multiply

. This is our logistic regression model .

Don’t look at him like this and feel very Scary.

In fact, there are many tools that can help us do calculations automatically

. Back to the problem we want to solve.

Today we want to use age, weight, blood sugar and gender

to predict whether we have diabetes.

So our characteristics have these 4 characteristics

That is to say, we want to find the best combination of W1234 and b

, bring it into the Sigmoid function here and bend it

, which can best represent our data

, and then we will directly implement it.

First of all, we are the same First read the data in, the action of reading

is the same as before, so I just copied it directly , but the URL of the data here is different, so I changed it

, and the URL is the same, you can find it in the course file

OK , let's execute it, and we

can see that this is our data. There are 0-399

in total , so there are 400 data

. After reading in, we first process the data. There is text

here . We need to convert the text into

Here I convert boys to 0 and girls to 1

, so let’s get its gender, which is the characteristic of gender

, and then I’ll convert it

. We can use map to convert it

, so boys correspond to 1 and girls correspond When it reaches 0

, let’s display it after converting it.

You can see that the conversion is complete

. Then we divide the data into training set and test set.

The division method is the same as before, so I just use it directly. The copy is

OK here,

just copy

it here, but the x and y here need to be changed. Our

characteristic is age and then weight... the part of y is whether there is diabetes or not. After this is done, I will display it to see

Look ,

display the x train and x test to see

OK. This is the result of the division

. Then

we also display the y part to see

the y train and y test.

OK, there is no problem.

Here we also use 0.2 for the test set, which is 2. Well,

we have a total of 400 records,

so the 2 results will be 80 records for the test set and 320 records for the training set.

Then because we have 4 features

, their distribution ranges are different . If

we want to let him in the future If the gradient descent

runs faster,

then we need to do feature scaling

. The feature scaling part is the same as before,

so I will directly use the copied

OK feature scaling here and copy it

directly .

Here, we will also use the scaling results. Display it,

and you can see that this is the result after zooming

. I also display the x test.

OK, there is no problem. After

the data processing is completed

, we can bring it into the model. First,

I set the values ​​​​of w and b casually. Then

here, I first introduce numpy into it. If w, I will make it an array with 4 values ​​in it

, because we have 4 features

. OK, there are 4 features.

W1234, I will let it be 1234

and then set another If b

b, I will let it be 1

, then our model is like this,

first let w multiply by x

, we are now in the training stage, so I multiply by x train

to display the result

, this is the result after the multiplication

, after the multiplication We want to add it,

it is the same as before

, so we use sum and let it be the part of the axis 1 to add

and then

execute , well, this is the result of the addition

, after the addition, we need to add b to it

Then this string will be the y we predicted in

the multiple linear regression before

, and now we are going to use logistic regression

, so I call this z

, and then we want to bring this z into the sigmoid function

I will first do the sigmoid function,

I call it sigmoid, and then it passes in a z

calculation method is to divide by 1 and add by 1

A certain power of an English letter e

, so you can write np.exp on the side of the English letter e

, and how many powers does he want?

It is written in the parentheses

, and what we want is the -z power

, so just put it It returns the denominator

here , and the denominator needs to be enclosed in parentheses

. Then I will import the numpy and write it here

OK to execute.

Then we will use it directly

. Let’s take this z in and have a look .

After conversion, you can see it. After this simoid function

, each value in it will be between 0 and 1.

So now we have successfully transformed this linear model

. This is originally a multiple linear regression. We

have bent this linear model.

The next thing is the same, we

just want to find the best combination of w and b

so that it can best represent our data

Loading...

Loading video analysis...