【機器學習 Machine Learning】3小時初學者教學 | 人工智慧 AI | Python | 機器學習入門 | 機器學習教學 #AI #ML #深度學習
By GrandmaCan -我阿嬤都會
Summary
## Key takeaways - **AI Enables Human-Like Machine Intelligence**: Artificial intelligence is the goal of making machines have the same intelligence as humans to handle complicated tasks. Machine learning achieves this by finding rules from past data, while deep learning uses neural networks imitating the human brain to extract those rules. [00:40], [01:19] - **Machines Learn Like Children from Examples**: Machines learn by inputting data like pictures of cats and dogs into a model, which initially guesses randomly and corrects based on errors, similar to teaching a child through repeated examples until achieving accuracy. After training, the model can classify new unseen images correctly. [04:12], [07:49] - **Predict House Prices from Multiple Features**: Use square meters, location, and rooms as features to train a model that predicts selling price as the label, correcting guesses against real data until errors are minimal. For a 72-ping house in Washington with 6 rooms, the model estimates 7.3 million. [07:52], [11:15] - **Linear Regression Fits Salary to Seniority**: Simple linear regression represents salary data as y = w*x + b, where x is seniority and y is salary, finding optimal w and b to minimize squared errors. For 2.5 years seniority, it predicts about 51K salary. [16:05], [17:53] - **Cost Function Measures Model Fit**: The cost function calculates the average squared difference between predicted and actual values, forming a parabola where the minimum point indicates optimal parameters. Using gradient descent efficiently finds this minimum by updating parameters along the slope. [34:06], [39:44] - **Gradient Descent Optimizes with Learning Rate**: Gradient descent updates parameters by subtracting the slope times a learning rate, converging faster with moderate rates; too high causes oscillation past the minimum, too low slows progress indefinitely. After iterations, it finds parameters yielding low cost like 32.69. [59:19], [01:07:12]
Topics Covered
- Machine Learning: Finding Rules in Data
- Teaching a Child vs. Training a Machine: The Same Learning Process
- Simple Linear Regression: Predicting Salary from Experience
- Testing Gradient Descent: Comparing Predictions to Real Data
- Feature Scaling Solves Gradient Descent Oscillation
Full Transcript
Hello everyone, I am Xiaobai.
This is a machine learning course that combines theory with practice
. We will use Python to complete the practical part.
So if you haven’t learned Python
, you can go to my Python course first.
The theoretical part is because of this It’s not a mathematics class
, so there won’t be too difficult mathematics in it, so don’t worry
. Well, let’s start without talking nonsense.
Hello everyone, I’m Xiaobai.
Welcome
to this course. The first one is artificial intelligence. AI
is full of artificial intelligence.
You should often hear artificial intelligence.
So what is artificial intelligence?
Artificial intelligence or you can call it artificial intelligence.
In a simple sentence
, we want machines Can have the same intelligence as humans.
Once
the machine has
intelligence
, it can help us deal with many complicated things.
Then
what is machine learning ML
? How to let the machine have intelligence?
One of the ways is to let the machine learn
. Now the question
is how to let the machine
learn. We can first think about how humans learn. Can
humans learn from the past history
and the past ? The same is true
for machines
. The past history and past experience of machines are
actually the data stored in the past.
So machine learning
, in a simple sentence
, is to find out from the data stored in the past. Out of the rules
OK Machine learning
is to find out the rules from the data in a simple sentence.
Then what
is deep learning
DL ? Learning
is to find the rules from the data
. There
are many ways to find the rules from the data. One of the methods is deep learning
. The method of deep
learning is to learn from the data
by imitating a neural network of the human brain. Let’s find out the rules OK
, so what is a neural network? I will introduce it to you
later in the course
. Okay, so let’s summarize the artificial intelligence, machine
learning and deep learning we just mentioned.
We can use this picture to do it. One said that
artificial intelligence is the ultimate goal we want to achieve
. We want machines to be as intelligent as humans , so
that they can help us deal with many complicated things
. How to achieve artificial intelligence?
One of the ways is to let machines learn
. In
a simple sentence, machine learning
is to find rules from data. There are many ways
to find rules from data
. One of the most powerful methods is through deep learning
. The method of deep learning
is to imitate the brain. A similar neural network to find out the rules from the data
OK, so this is the relationship between artificial intelligence,
machine
learning and deep learning
. Next, let’s understand how the machine learns
.
The machine has been mentioned before. Learning is equivalent to finding out the rules from the data
. Then we will further discuss
how the machine or computer finds out the rules from the data.
In fact,
it can be achieved by using some mathematical skills and programs to make the machine find out the rules from the data. Out
of the rules, let’s take a look at
what the process of machine learning looks like . Suppose
we have some picture data in our hands . These pictures are pictures of dogs and cats.
We want the machine to learn from the data of these pictures.
Learn how to distinguish between dogs and cats OK
, that is to say, we want the machine to find out the rules for distinguishing dogs and cats
from these materials, so what can we do? You
can first think about how humans learn to distinguish between dogs and cats
Suppose you want to teach a child today.
This child does not know what is a cat and what is a dog
. You have to teach him how to tell the difference.
At the beginning, you may take a picture
and ask the child whether it is just a cat or a dog.
You can see that the eyes of the child are quite different. Because
he doesn’t know what you’re talking about, he just wants to go to bed quickly
, so he casually answered him, saying it’s just a dog,
obviously he got the wrong answer
, then we, as teachers, give him a cross
and tell him
The animal that this child looks like is a cat.
Then we took another picture and asked the child
if it was just a cat or a dog.
You can see
that the child’s eyes revealed a little confidence compared to the last time
. He replied It’s a pity that he said it’s
just a cat. It’s a pity that he got the wrong answer again
. As teachers, we gave him a cross again
and told the children that the animal that looks like this is a dog
. Just like this, after constantly looking at the photos, let the children
answer .
As long as the child answers the wrong answer, we will ask him to correct it
. Over time, the child has seen too many pictures of cats and dogs,
and he will probably know what a cat looks like and what a dog looks like
. So you can
ask him if it is just a cat or a dog. He will tell you very firmly and confidently that this is
just a dog. The same learning method
can be applied to the machine.
We can input a picture into the machine
or input a picture into the computer
and ask him if it is a cat or not.
What about dogs? After entering into the computer, some programs will be triggered
. Behind these programs, some mathematical skills are actually used. Here
, mathematical skills and programs can be combined
. We can also call it a model,
so here we can say that
we You can input a picture into the model
and ask him whether it is a cat or a dog
. At the beginning, the model is the same as an ignorant child. He
does not know what is a cat and what is a dog
, so he guesses randomly. It’s just a dog. Obviously
he got the wrong answer
. We have to give him a cross at this time
and tell him that the one that looks like this is a cat.
Please correct the model and that the one that looks like this is a cat. Then
we took another picture and entered it .
Input it into the model in the machine and ask him if he is a cat or a dog
. This time he answered that it was a cat.
Obviously
he answered wrong again. Give him a cross
and tell him that the one that looks like this is a dog. Please correct the model
and it will pass. Continuous training like this,
constantly show the machine a lot of pictures of cats and dogs
. If he gets the wrong answer, ask him to correct it
until the model has a certain accuracy rate.
Input a picture of a dog
and this model tells you that this is a dog
. After training the machine or training the model,
suppose you have a new picture in the future
and you want to ask whether it is a cat or a dog
, you can input it into
In the model, he will tell you that this is just a dog
. Next, let’s look at an example.
Suppose we have some data on house sales . These data are
the same as the square meter of the house and its corresponding sale price. We want the machine to start from
Find out the rules from these materials, find out
the rule between the number of square meters and its corresponding selling price
, that is to say,
we want to guess what the selling price should be based on the number of square meters
. In this way, we can say that the number of square meters is called The characteristic feature
price is called the label label.
We guess the price based on the square number,
so the square number is called the feature price is called the label
. Then the training process is similar to the previous one.
We can input the square number, which is the feature, into the model
. The model is ignorant at the beginning.
So he will make random guesses.
Suppose he guesses 4 million
, let’s take a look at the information we have.
According to the information we have in hand, the price of this 50-
ping house is 5 million
, which means our label is 5 million
, which is obviously far from his guess
. So at this time, we tell the model that the
house is The selling price is 5 million. Please correct it.
Next , enter the next data of 66 pings
. Then enter the model and he guesses 9 million.
Let’s look at the data in hand . It is 6.5 million and
the label is 6.5 million.
Constantly revising, constantly
looking at the data
, until the final prediction of the model is consistent with our real data,
that is, when the error value of the label is not large
, we say that the model training is completed. After
the model training is completed
, suppose you have a house and want to sell it It’s 72 pings. You
don’t know how much he should sell
. You can input the feature of 72 pings into the model
and ask him to predict how much he should be able to sell
. You may have doubts at this point. The price of
a house
should not just be It can be determined by how many pings it has. We
may also need to look at its location, its layout and other comprehensive factors to evaluate how much a house is worth.
There is nothing wrong with it , so you can collect more complete information.
In addition to the number of pings, we have also collected his location and the number of rooms
, so now the machine needs to estimate the selling price based on the number of pings, location, and number of rooms
. Now we will talk about the number of pings, location and number of rooms
The room is called a feature, and the price is also a label . Because
we estimate the price by the number of square meters, the location and the room, these three are the characteristic price and the label
. The learning process is the same as before.
We input the characteristic square meter, location and the number of rooms.
Into the model, the
model is ignorant at the beginning, so he will guess randomly
. Suppose we guess 4 million at the beginning. Let’s take a look at the data in hand, which is
the label. The label is 5 million, so please correct the model
and then enter the second data,
the same. Ask the model to speculate that he speculates 9 million.
Look at the label, it is 6.5 million, and then ask the model to correct it
. Just keep training
until the error value of the model reaches a certain range
. We say that the training of this model is
completed.
In the future, if
you have a house
It is 72 pings, there are 6 rooms in Washington that
you want to sell. If you want to guess how much he should be able to sell
, you can input it into this model.
He estimates that it can sell for 7.3 million
. This is probably the process of machine learning.
Next , we will go directly
Implementation In
this class, we are going to use colab
. Colab is a python writing environment provided by Google for free.
It allows us
to write python programs very quickly and easily on the browser
, but because it is provided by Google
, you need to create a Google first. Account and log
in . After logging in, you can see 9 dots here
. Click to find cloud hard drive.
We can see a new addition in the upper left corner
, and then click here.
If you have found Google colab here, you can
Click it. If you
can’t find us, click to connect to more applications . Well, we can find colab here. Click to enter
and click to install
. Click to install. He may ask you to log in.
Then you
log in. After the installation is complete,
we will Just click OK to complete
OK, then we will add more here, you
can see that there is an additional Google colab,
click in, after clicking in, you can see
that it has helped us create an environment for writing Python
. First, let’s see The upper
left corner, the upper left corner
is the name of the file
, you can modify it , assuming
I change it to the environment to build OK,
then the modification is over, and because I am more used to the background being black
, so I go to the tool to find the settings
Then here is a theme in the website
. We select Dark
and click to save the background and it will become black.
Of course , you can also set your favorite color
. On the editor side, you
can also set the size and spacing of the text, etc. Waiting, I
leave it to everyone to play by yourself
. Before we write the program, we need to do the connection action.
We can go to the upper right corner to find a connection point
, and he will allocate some resources for us to use
. Wait for him After
the connection is completed, we
can see the block in the middle . The block in the middle is where we can write programs.
First of all , we
can enter the program code directly in this grid.
Suppose I enter print 87
, and I will enlarge it a bit for you to see. It’s
clearer. Well, after printing 87
, there’s an execution button here, click
it , and it will start to execute
, and the execution result will be displayed below.
In colab, the code can be divided into pieces.
We can put the mouse Put it a little bit above this grid, and you
can see that it jumps out a code or text.
If you put it a little bit below, it will jump out.
Suppose I click to add a code
, and it will have an extra grid
. That 's what we You can write python programs in this grid
. Suppose I write print 88 and press execute
. You can see the execution result will be displayed below
. Let’s take a look. If I click to add text
, it will jump out of a grid where I can write text
. Suppose I Write a hello, me, hello, everyone
. In addition to writing text
, you can also choose some fonts, fonts, etc. You
can see that there is an additional text grid here
. OK, we can add a lot of code grids, and we can
also add a lot. If you don’t want
the grid of the text ,
you can delete it here
. There is a symbol of a trash can, click on it and delete it.
Delete Delete Delete OK
. Colab has another important point , which is that
it can provide free GPU. Let us use
it with TPU. GPU and TPU can help us speed up a lot in computing
. We can find the editor on the top
and then there is a notebook setting
. You can see that there is a hardware acceleration here.
You can choose GPU or TPU Then in the following classes, we will also use GPU for acceleration
, so suppose I choose GPU here and press save
, you can see that he will help us re-allocate the connection,
and we have to wait for him
. After he is connected, he
will Ask us if we want to delete the previous execution stage.
Suppose I press cancel first
. After connecting the GPU,
we want to see which GPU he uses for us
. We can type it in the grid of the code
! Press nvidia-smi to execute,
and you can see that the GPU he is using for us now
is this Tesla T4 OK
. Then, our environment construction is complete.
The first mathematical technique I want to introduce to you
is called simplicity. Linear regression is simple linear regression
in English
. You can see that there are two words in front of its name,
so you can imagine that it is very simple
. Don’t worry too much.
Let’s first describe the situation we encountered today
. Suppose today you are a new The boss of a start-up company
, you want to hire your first employee
, but you are not sure how much salary you should pay him
, so you go to the market and
collect some information about people in the same position
. These data are their years of work.
With the corresponding salary
, let’s take the data here and draw it in a graph, so that
the x-axis is the seniority and the y-axis is the monthly salary. Each cross
here
It means every piece of information you have collected
. From this picture, we can easily see
that the seniority is directly proportional to the monthly salary,
that is, the higher the seniority, the higher the monthly salary
. Well, now the problem comes.
You are a new company.
The boss of a start-up company , you
want to hire your first employee
. Your first employee comes and he tells you that his seniority is two and a half years, which is 2.5
. So how much salary should you offer him? You can do it at
this time. Simple linear regression is used.
Since seniority is directly proportional to monthly salary
, we can probably use a
straight line to represent these data.
Simple linear regression
is to represent the data with a most suitable straight line
. Suppose now that we have found this The most suitable straight line
, then we can bring the 2.5 seniority
into this straight line
to see how much salary we can give this employee
. After bringing it in, we
can see that we can probably give this employee a salary of 51K
. Now The problem
becomes how to find the most suitable straight line?
Let’s first look at how
a straight line can be represented mathematically
. We can write y=w*x+b to represent a straight line
. Then , this formula Applying it to the present example,
y will be equal to the monthly salary and x will be equal to the seniority.
So the problem now becomes that
we have to find the most suitable w and the most suitable b
to represent these data.
We have to find the most suitable w and the most suitable b
to represent this straight line
. Well, let’s implement it directly to see
what kind of straight line will be produced
by different w and different b.
First, let’s read in the data first
. We can use the Pandas module
To do the reading action.
Here I call it pd
Pandas. This module
is a very useful data processing tool
. We don’t need to install it in colab
because the default is already installed
, so we can directly import it for use
. Next The URL of the
data we want to use can be found in the course file
. I use a variable to represent it. Here,
pay special attention to the file we want to read
. It is a CSV file
, so we can use Pandas
under it . read CSV to do the reading action.
We only need to write the URL into the parameter and it can be read
. Here, I also use a variable to represent it.
We can directly display the read result and execute it
. This
is what we do The data to be used,
seniority and his corresponding salary
, there are a total of 0-32, that is, 33 data. We want to use a straight line to represent these data
. We have said that a straight line can be written
in
mathematics as y equals w times x plus b
. In our example,
we want to use seniority to predict salary
, so seniority will be x and salary will be y.
We first separate x from y, and
we
want to get the column of years experience. If you want to get this column, you
can
put brackets years experience behind it so that it will separate out that column. The
same is true for x and y.
I want to get the column of salary
, so I will display it to see
if there is no problem with x and y.
Then Well, let’s first draw the pictures corresponding to these materials to see.
I ’m opening another grid. If
you want to draw pictures, you can use a very useful suite called matplotlib.
We use the pyplot under it.
I call it plt,
which is the same as matplotlib. The presets in colab are also installed
, so we can directly import and use
them. I can use the scatter under it to draw the picture.
We only need to pass the x and y
that we just separated into it, and it will help us draw the picture.
Then Finally, I want to display it and write a show
to execute it. You
can see that it will display our data bit by
bit
in the form of a point. If you want to change the style of this point
, suppose I want to put I want to change its color into a cross
. Then I can add some parameters
. I can write another marker equal to
the original
mark . I want to change it into a cross, so I can write it. x
Then I also want to change the color. I can set the color
to turn it into red. I write Red so that it can be executed.
You can see the mark
and it will turn into a red cross. As for what other marks and colors can be used here
, if you are interested I leave it to you to research by yourself.
Here
I can add a title to the picture. I can set its title
assuming that my title is seniority corresponding to his salary
. Let’s execute it. You
can see that there seems to be a little mistake
in this part of the title. It has
4 frames. The reason for this error is that
matplotlib does not support Chinese by default.
If we want to display Chinese
, we can add Chinese fonts by ourselves
. Well , let’s add it
. First, we need to download one first. Chinese font
Here I create a new grid to do the downloading action
. We can use the wget tool to do the downloading
, but it is not installed in colab by default
, so we need to install it first. We
want to install the package in colab
Module, you can enter pip install what you want to install after the exclamation mark. After installation, import it.
Then we can use the download below it
to do the download action . Just
write the URL of
the thing you want to download in the parameters.
That’s ok. Then
this URL can be found in the course files
. This is the font we want to download
, so we can execute it directly
. It will be installed first and then imported. After the
download
is complete,
let’s take a look. You can find a file
on the left and open it
. This is what we just did. The downloaded font
is called Chinesefont.ttf
. Let me close it first. After
downloading the font , we can add it to matplotlib.
If
we want to add fonts, we will import matplotlib. I call it mpl
and then import it from matplotlib. The font_manager under it
is imported into the fontManager.
After importing everything here, you can use the addfont under the fontManager to add fonts
. The font we just downloaded is called Chinesefont.ttf,
so we write Chinesefont.ttf here.
After adding the font, we It is also necessary to set this font to be used now
. To set this font to be used
, we can use the rc under mpl to set something. What we
want to set is the font, so the first parameter writes font
and the second parameter specifies to use The font is Chinesefont
, so it should be fine.
Let’s add the font first
, then set the font to be used
and execute it again
. You can see that the Chinese can be displayed normally .
In addition to setting the title of the chart
, we can also set it. The x-axis and y-axis labels of the chart
To set the x-axis label, we can use
the x label to do the setting
, then our x-axis is the
seniority, so I write the seniority,
and then the y-axis
, the y-axis is the y label, and
the y-axis is the monthly salary
Then its unit is thousand,
so we write the monthly salary
Execute the brackets again, and you can
see that the labels of the x-axis and y-axis are now up with seniority and monthly salary
. Next, let’s draw the straight line.
Here I’ll add another grid. As I
mentioned just now, a straight line in mathematics can be Written as y=w*x+b
, pay special attention to the fact that we want to use the value of x to predict y,
so this y is different from our y,
this y is the predicted value
, and this y is the real data,
so Let's go back to the bottom
and we can use x to multiply a w plus b
to represent our predicted y
, so I call it y_pred to represent the predicted y.
Now,
suppose I first set w to 0 and b also set Let’s take
a look at what the straight line drawn by w=0 b=0 will look like.
To draw a straight line, we can use the plot under plt to draw it
. The first parameter is to write the value of x.
The second parameter is to write
Let’s see the value of y. This time, our y value is the predicted result of y_pred. We
also need to display it and write show
OK. Let’s execute it and see.
You can see that this is
the straight line drawn when w is equal to 0 and b is equal to 0.
Then I
You can also change its color.
Suppose
I set its color to blue
, which I think is good. Then I will
draw the above information and these crosses into it, and
I will put all these here. Copy it,
copy it here to execute it again
, this is our prediction line , these
are real data , our goal is to find a straight line that can best represent these data
, now this straight line is w=0 b=0
So we need to find the most suitable w and b to represent these data
. Before that, we can add some legends to the chart.
Our plot is to draw a straight line,
and the scatter is to draw these crosses.
We can Add a label behind it
, the legend is the prediction line
, and then
we can also add a label
on the crossed side
, which is the real data. After
adding this label , we can write it here plot.legend
will display the legend, let’s execute it,
and
you can see that the graph here is displayed.
This
straight line is the prediction line
, and the crosses are the real data. Let’s first write these here as a Function
This way, let us bring in different w and b
. I call it plot_pred
, and then it can pass in w and b
to indent it.
Let’s try to bring in different w and b
to draw it. What will the result look like?
Here we can call plot_pred.
Assume that I put in 0 and 0 first and execute it once. You can
see that the result is the same as just now
. Suppose I modify the value of b and
I change it to 10 and execute it again
. Seeing this line, it seems that it has not changed
, but in fact it has changed
. It is because
the values of the x-axis and y-axis are not fixed
, so the graph looks like it has not changed
. So this Let’s fix the values of the x-axis and y-axis first. If we
want to fix it, we can write
plt.xlim like this. We set its maximum and minimum ranges.
If the x-axis is about 0-10, here it is There are a little more
, so I set it to 0-12 , here we can write it like this, write a list
in it
, and then set its minimum and maximum values
and
the value of y is the same
, we set its maximum and
minimum. On the other side, it is possible to get negative numbers
, so I let it be -60 at the minimum, and
140 at the maximum.
Then we execute it again, and you
can see that the values of the x-axis and y-axis will be fixed now.
Execute once and
the values of the x-axis and y-axis will be fixed
. The y-axis is -60 to 140 and the x-axis is 0 to 12.
Let’s try again. The result of 0 and 0
is the result of 0 and 0. This line is when y is equal to 0 In this place
, if I bring the value of b into it and change it to 10 and execute it again
, you can see that the straight line is going up
, so we can know that the value of b is to control the up or down of the straight line. If
it is changed to 40,
it will go up . If it is -30, execute it. You
can see it and go down
. Now let’s try to change the value of w
. It was originally 0. Let ’s change it to 10 and see.
This line has become slanted
and slanted up
. If it is changed to 20,
it will be
more slanted . If I change it to a negative number -10,
you can see that it will fall down and slope downward. If it is positive, it will slope upward.
Negative
number If it’s a downward slope, I’ll try -5
on this side ,
and I can see that
it’s a line like this. We can’t see this side because it’s beyond the border
. Then let’s make this picture dynamic
, so that we don’t have to.
I have been manually adjusting the values of w and b.
If I want to add some interactive components in colab
, we can use the ipywidgets tool.
Here I introduce the interact in it
. Then I can do this and use it. Write a function in the first parameter
Name Then what we want to dynamically adjust are the values of w and b
, which are the two parameters to be passed in by this function.
Here I can set the value of w
. I hope its range is between -100 and 100 and then Its spacing is 1,
and then set the value of b
. Suppose I want it to be -100, and the spacing between 100 and 1 is the same
. Let’s execute it directly. You can
see that there will be more interactive components here
. The default value is w and b is all 0 and I can make adjustments. When
I add w to 8, the line will look like this.
Then add b to the top
and it will slowly move up and then add more.
Everyone can play around and
adjust different ones. Let’s see
what the straight line will look like with different values of w and b.
Well, I’ll leave it to you to play with.
After seeing what kind of straight line will be produced by different w and b
, then our question will become what
is the best way? What about the straight line suitable for these data? Let
’s define it
and give each line a score.
Let’s first look at a set of relatively simple data
. If I also want to use a straight line to represent these data
, suppose today I am Using this straight line
, this straight line is the result of w equal to 0 and b equal to 0
, then we should give this straight
line a score for how well it fits these data , or you can say, how well
this
straight line fits these data
Let’s give a score. Then how do we give this line a score? It
’s very simple.
We
just need to calculate the distance between these real data and this line
. Because if the data is closer to this line If they
match, the distance between these data and this line will be smaller
, so we only need to calculate the distance between each line
and these real data , and then find the smallest one among them,
and that line will be the most suitable for these.
The straight line of the data Let’s
do an actual operation directly
. In this example, there are
a total of three data points
, which are respectively at the positions of 1, 1, 2, 2, and 3, 3.
We want to use this straight line to represent these data
. This straight line is w Equal to 0 b is equal to 0
and then we want to
give a score for the suitability of this straight line with these data.
How to score ?
We can use the sum of the distances between these data and this straight line
Use it as the basis for scoring. If
this point is 1-0, the distance between it and the line is 1.
If this point is 2-0, the distance is 2. Here, 3-0 is 3.
So our formula will be written like this,
you may see Here, besides 1-0, I also squared it
. The reason for doing squares here
is
because it is convenient for us to calculate
, because you may have negative numbers in the future
, so you can solve the problem of negative numbers by directly square.
Solved So in this example, the square
of 1 minus 0, the square of 2 minus 0, the square of 3 minus 0, the sum of
the final
distance , it should be said that the sum of the square of the distance will be 14.
Let's look at another line
. Suppose we want Use this line to represent these data .
This line is the result of w equal to 2 and b equal to 0.
We also calculate their distance
. The point here is also 1, so here is 1 minus 2, 1 minus 2
Then here is the square of 2 minus 4, 2 minus 4
, 3 minus the square of 6.
You can
see that there will be a negative number here
. If we don’t have a square, there will be a negative number here
, so after the same calculation here
Do the sum
, the same
is 14. Then look at the next line.
You can see that this line is quite consistent
. It is the
result of w equal to 1 and b equal to 0. Then calculate the distance
, and you can find that each of them has the same value
, so the calculated distance will be
If the three lines are 0 , we will say that the line where w is equal to 1 and b is equal to 0
is the most suitable straight line for these data
, and then the scores
or distances generated according to different w or b
The sum of squares , we can write it as a function
. This function is called cost function
in Chinese, it is a cost function.
In our example
, the cost function can be written like this.
It takes the real data to subtract the predicted value and then squares it
, that is, we To calculate the real data, the square of the distance
from that straight line will be the cost
, which is what we just said about the score
. Let’s go back to the original example
. We want to find a straight line that best fits the data
, so we need to help the straight line evaluate
A score , that is, to calculate the cost here
, let’s assume that all the straight lines now
have their b equal to 0
, let’s see how much the cost will be corresponding to different w,
we will bring in one by one
, first bring in one w is equal to -10 or so, and its cost will be here
, then bring in the second and third
, bring in all the way, bring in and bring in,
finally we bring in many, many points
, we will find that the cost function is actually the cost function
, it will It is a parabola like this.
This is the case
where b is equal to 0. If b is not equal to 0, what will the cost corresponding to w and
b
look like? The value of w here is the value of b
, and the z-axis is the value of cost
. You can see that its graph will look like this.
The red point is the point with the lowest cost
. Let’s look at two other angles. From these two angles, we
can You can see
that the place with the red dot, as I just said, is the place with the lowest cost
. We can see
that the place where w is equal to 10 or so
, and where b is equal to 20 or so,
will produce a point with the lowest cost
. This is our goal. Well,
our goal is to find the most suitable straight line
. To find a most suitable straight line
is to find the distance
between the straight line generated by the most suitable w and b, the most suitable w and b,
and all the real data. The sum of the squares will be the smallest
, that is , the place where the cost is the smallest
here. After we understand it,
then we will directly implement the cost function
. Here, I opened a new colab file
to implement the cost function.
At the beginning, we need to read the data first. The method of
reading the data is the same as before
, so I directly copy and paste
it for execution. After the data is read,
then we will implement the cost function
and open a grid.
Our cost function
is Take the real data and subtract the square of the predicted value,
so here we can write it like this
, first calculate the predicted value,
I call it y pred
, it will be equal to w, multiply x and add b,
then w here I don’t know anything about b and b yet
, so let’s assume that w is equal to 10 and then b is equal to 0.
After calculating the predicted value, we can take the real data.
The real data is y minus the predicted value y Pred
, and then we need to square it
. So I put it in brackets and square it.
This side is called cost
. Let’s display the cost directly. You can
see that there are a total of 33 values displayed
. These 33 values are the real data minus the prediction. Value
and then do the squared result
, which is our squared distance.
If we want to sum up the squared values of these distances
, just write .sum() after it
to display it
. You can see that this is the summed up
The result is the sum of squared
distances . If the value here is too large, we
usually average it
. We have a total of 33 records,
so we will divide by 33 here
, but it is better to write it like this
. To calculate the length of
x, the length of x is 33,
so dividing this by 33
is the average of the squared distance. Next, let’s write the cost calculation here
as a function
, so that we can bring in different w and The value of b
is good
to open another grid . Here I call it compute cost
, and then it needs to pass in our data
x and y and the predicted line, its w and b values.
The calculation method is the same as here
. Paste it directly. We
need to do the predicted value first.
The predicted value is w multiplied by x plus b
, then calculate the cost
and then add it up and take the average.
I will make the final cost equal to the sum and then take the average
result. Well, let’s pass this cost back, let’s try to see
what kind of results will be produced by adding different w and b, let’s call
here
to see
that xy is the real information here, and w
we just brought 10 and b 0, w is
the same as b, let me try to execute 10 and 0,
he said that the compute cost is not define, oh, it has not been executed yet, so it
needs to be executed first and then executed here, and the value returned is 602 points
, which is the same as before. The input is 10 and execute it again.
You can
see that the cost this time is 227 points
. Then you can import it yourself
to see
how much the cost will be generated by different w and b
. Then let’s try it
when the fixed b is equal to 0.
In this case , let w be between -100 and 100
, and then let's see what its cost will be.
Here , I will first define a cost, let it be a list
, and then store the cost value of w between -100 and 100.
OK What about here,
we can use a for loop to do calculations
. I let w make it range from -100 to 100. Here
I want to write 101
, and it will generate a similar one
from -100 to 100. the list
Then we have to
use the function on the cost calculation side to do it directly
. Use xy and fix b equal to 0.
Let it always bring in the value of w.
This side is called cost
and then append it to our costs. In the list, I finally
displayed it to see how to execute it
. Here, the total number of w equals..
Wow, there are a total of 201 values
from w equal to -100 to 100. I will turn off
its cost
first
and then Let's draw the cost corresponding to different w
into a picture to see.
Here we introduce our drawing tool matplotlib and the pyplot under it.
I call it PLT.
Then we can use scatter
to put each data point into it. Draw it.
We draw each w and the corresponding cost
. The w on our side is between -100 and 100. The cost corresponding
to
it is displayed and executed.
You can see that it is between -100 and 100.
There will be a total of 201 points. The
201 points will be so densely packed
when drawn. Besides drawing like this, we can
also directly connect them into a line
or a parabola. Here we use plot to pass in w and Corresponding to the cost
and then execute this parabola to make a connection.
Here I will help him add a title and a label
. This one is the cost function when b is equal to 0 and then w is between -100~100
. Then set its xlabel
xlabel is The value of w
and then
execute the ylabel cost
to see that the title and the labels of the x-axis and y-axis are all up
. Then we also take the value of b into account and let
the value of b also range from -100 to 100.
Let’s see what the cost will be How much?
Here I introduce a very useful tool for matrix operations, numpy
. I call it NP.
Now both w and b are between -100 and 100.
We can use arange under NP . The usage of this arange is
basically
the same as above. The range is almost the same.
I want to create a matrix between -100 and 100, and the interval is 1.
We can write it like
this. Here, I call it ws because there are many w
and b are the same. I call it bs
Then I create a two-dimensional matrix.
Here I can use NP point zeros
to create a matrix with all 0s in it.
The first dimension is 201 and the second is also 201
because the value of w here is total. There are 201 b values and 201.
This matrix is to store the costs corresponding to
different w and different b
, so I call it costs
and then you can do calculations.
I first use a for loop to run through all The value in ws
and then use a for loop to run through all the values in bs
, and then we can calculate their cost.
If we want to calculate the cost, we have already written the function above,
just take it down and use
it directly copy here Come down
, well , xy is the real data
, and then the cost will be calculated by passing in wb here, and
we can store it in the costs matrix.
Here , we define another i, which is equal to 0 at first
, and then define j here. If it is equal to 0
, then the values of i and j here will be equal to this cost
, then let j add 1
here, and finally let i add 1 here,
after the calculation, all w and all b
will
be added The cost corresponding to the combination is stored,
stored in this cost
, and displayed to see the execution.
He said that numpy has no attribute
and wrote an extra r here
to execute it again.
He needs a little time to wait for him
. It looks like this A two-dimensional matrix
. Each value in it
represents a corresponding cost calculated by w and b. Next, let’s
draw the cost calculated by considering w and b at the same time.
Take a look Because
we need to consider the values of w and b at the same time
, we will draw a 3D graph.
To create a 3D graph, we can write plt.axes
and set its projection=3D. I call it AX
and we will display it first. Come out and
execute it
. You can see that a 3D image is created
, but there is nothing in it now
. We can see that the edges of this image are a bit gray. I
don’t want it to be gray
. I can also make a setting
. I can Use ax.xaxis.set_pane_color to set the color . I want it to be white
. I can write an rgb value here. For
white , let it be 0 0 0
for easy execution. He
said that there is no attribute
and it is reversed here
, so you can see it
When it comes to the x-axis side,
it will become white.
Then I will use it on the y-axis and z-axis
side to copy it directly
. Just change it to y and z
to execute it again
, and it will be white. It feels much more comfortable.
Next , let’s draw the cost corresponding to w and b as a surface graph
. Here we can use the plot_surface under it
to compare the w value we just created with the value of b
, that is, ws and bs These two matrices are passed in
, and the last is the cost we stored.
Pay special attention here
. In fact , we don’t just pass in the two one-dimensional matrices of WS and BS
. What we want is the two one-dimensional matrices. A two-dimensional grid is generated . If
we want to generate this two-dimensional grid
, we can use the mesh grid under it in numpy,
and here we pay special attention to the first one. We first pass it into BS
and then pass it into WS.
Then it will be It will be made into a two-dimensional grid and sent back to us
. I will call it
b grid w grid.
If you want to know more about what this two-dimensional grid is and how
it works
, you can take a look at this URL for
his
explanation.
It is very detailed and I will not go into details here . If you are interested, I will put the URL below and you
can
study it yourself . Here we will change it to pass in w grid and b grid
, so it will be no problem. We can execute it and you
can see this surface The picture is displayed.
Let’s add some titles and labels to this picture
. Here, we can write ax.set_title.
Here, pay special attention to the fact that we need to add an additional set in front of it,
which is different from the previous one. In the
front , just write title directly . If we want to add a set
title, we will let it be the cost generated by wb,
that is, the corresponding cost
will be used in Chinese. If it
is used in Chinese, we need to add Chinese fonts
. So I will go back to the previous file. Inside, the Chinese language is added in this place .
I still need to download it first, so I will paste it here
. This is to download it, execute it first, and
then add it
. This is to add it
, so I will paste it here.
First
download and then add the font. After the title is set
here ,
then we set the labels of the xy and z axes
. Here , we also need to add set xlabel
. The label of the x axis is w
, and then set it..
here I directly Use copy
to set its y-axis label, the y-axis label is b
, and then the z-axis, and the z-axis is cost
Ok, let’s display it again,
you can see that this is w
and then b, and their corresponding cost is the z axis
. If we want to rotate this picture, it is also possible.
If we want to rotate,
we can do it here Set a view init,
which can pass in two parameters.
The first parameter is the rotation angle of up and down
, and the second parameter is the rotation angle of left and right.
Suppose I write 45 for the first one and -120 for the second one. Execute it again
, and you can see now As shown in the picture, it turns like this:
this side is w, this side is b, and this is the cost
. Well, you
can also play with the angle of rotation.
Then I think the color of this curved surface is not very good-looking,
and I want to make it
One more modification, you can set another parameter here called cmap
, I set it to spectral r,
let’s execute it and see,
you can see that the color is much better now
, as for what other colors can be set here, I
will leave it to you Go research,
I think this color is good
, and then we can set another opacity value
, the opacity is called Alpha Alpha
, let it be equal to 0.7
and execute it,
you can see that now it has a more transparent feeling, it
looks like It’s more comfortable.
Then, if we want to make this picture look better
, I can add a border to it.
I can write a plot wireframe
and pass in the first three parameters
. Then I set the color of the border to black
, so let’s execute it.
You can see that this border is added now
, but it is a bit too dark
. We can also set its transparency Alpha
assuming that I make it 0.1.
Then execute
it. You can see that it is very comfortable and beautiful.
Next, let’s put Find out the point with the lowest cost. To
find out the point with the lowest cost
, we can use the min under it in numpy
to find our entire cost.
I will print it out to see what
the lowest
cost is
. The lowest cost It is more than 32.69
.
If we want to know
the w and b corresponding to the cost of more than 32.69, we can do it like this. We
can use np.where to find its location and find the lowest cost among all the costs
. Where is the index of the position ,
because this cost is a two-dimensional matrix
, so it will return two values, that is, two indexes.
I call it w index and b index
. Then we can put these two values Print it out and
execute
it. You can see that the index it finds is at 109 and 129.
We need to find out the w value corresponding to this index from all w matrices . The part of b is the same
. OK , let’s execute it again.
Once, you can
see the corresponding value.
The corresponding w value is 9 and b is 29.
That is to say, when w is equal to 9 and b is equal to 29, there will be the smallest cost.
Here we can write like this.
When w is equal to this value and When b is equal to this value, there will be
the minimum cost
. Then the minimum cost can be obtained from the two-dimensional matrix of costs. The value
to be obtained is the value corresponding to the two indexes
. Let’s execute it again and
it will be Speaking of which,
when w is equal to 9 and b is equal to 29, there will be the minimum cost
, and the minimum cost is more than 36.69.
Finally , we can also
draw the point of the minimum cost. If
we want to draw it, we can Use
the first value of scatter to pass in the value of w, then the value of b,
and then the z-axis, which is the value of our cost. It is easy to execute. The point where you
can
see the smallest cost is here, so
I can put it Change the color
, set its color equal to red
, and then I want to make it bigger
, set its SS to be the size
, and let it be 40, so I can execute it again.
You can see that it is much more comfortable
. Finally, I want to make this To make the picture bigger
, I can go to the top to set it
. I can write plt point figure
and then set its figsize.
He can set two values.
These two values represent the width and height of the figure respectively
. Suppose I start with 5 and 5 are fine.
Let’s see what it
looks like. It’s still a little small. I’ll make it a little bigger.
7 and 7
will make it look more comfortable
. Well, the other parameters here are the rotation angle. The size of
the picture Or color, etc.
Everyone can make adjustments by themselves
. This is the implementation of our cost function. After
reading the cost function , our question will become
how to find the best w and b efficiently. The best
w and b correspond to the point with the lowest cost. It is not difficult to see
from our last implementation
that we use a brute force method.
I exhaustively enumerate all the values of w from -100 to 100
and then go to Look at its cost
and find the lowest point from
it. The w and b considered here are the same.
I exhaustively enumerate all the combinations of w and b from -100 to 100 , and then find out
the corresponding cost. Its lowest point
, but this is not a good way
. We must efficiently find out the best w and b.
To efficiently find out w and b,
we can use a method
called gradient descent, which is gradient descent in Chinese
. Everyone Don’t think too hard about it
. In fact , gradient descent is to change the parameters according to the slope
. In our example, the parameters are w and b,
so the values of w and b are changed according to the slope
. Let’s directly look at the gradient descent. How does it work ?
Let’s take this as an example
. Let’s assume that b is equal to 0 and only consider w.
How to find an optimal w
that can minimize the cost? First,
we need to set an initial value of w . Then
The initial w value can be set randomly.
Let’s say
I set it here
, which is roughly equal to -75. Then we can
calculate the tangent slope at this point through differentiation.
We can
calculate the tangent slope at this point through differentiation
. Everyone here
Pay special attention
when we use gradient descent
. In fact, we don’t know the blue line.
We don’t know the blue parabola.
This parabola is obtained through our exhaustive violence
. So you don’t
I don’t know that the lowest point is here
. Let’s re-
describe the problem now. It can be like this.
Today you are blindfolded and thrown into a place
where you can only go forward or backward . What is your goal? You
have to go to the lowest point , but fortunately, you can use some method to calculate the steepness of the front and rear
of your current position, and then you can use this steepness
to find the way to go down
, back The same as the original example,
at this point
, you can
calculate the slope of the tangent line through differentiation.
The slope of the tangent line is equivalent to the degree of steepness
. Then we can use this slope
To find out the way to go
down , let me briefly explain
how the slope is calculated.
First of all, we need to set what the cost function looks like
. In our example, the cost function looks like this.
We use real data. To subtract the predicted value and then square
it . In mathematics
, it is to subtract ypred from y and then square it.
This ypred can be expressed as w times x plus b
because we want to represent the data as a straight line.
So here is w multiplied by x plus b
and in this example b is equal to 0
, so we can omit it directly
and then we just need to differentiate it with respect to w to
get the slope of the tangent line
. After differentiation President, I will not show you the
detailed differentiation process. If
you are interested, you can study it yourself
. In fact, it doesn’t matter if you don’t know what differentiation is at all
, because there are many tools that can help us do
calculations automatically.
After the differentiation,
if we want to know
what the tangent slope will be when w is equal to -75,
then we will bring in -75
. The x and y here are our real data
, and we can calculate
the tangent slope
by bringing in the same.
How much is it ?
After we know the
slope it
is equivalent to knowing the steepness here. After knowing the steepness here
, we can go down. How to go down?
We can subtract the slope from w and multiply by A learning rate ,
what is this learning rate, I will explain to you later . Let’s look at the previous w to subtract the slope.
In this example, now w is about -75
, and the tangent slope is obviously a negative number
, so we let w to subtract a negative number
, that is to say, it will add a value
, so after adding a value, it will move forward
, then we repeat this action all
the time, and then calculate the tangent slope at this point
, then we bring the number with You can find the slope when you come in. After
you find the slope, bring it to the following formula . Now, where w is about -60
,
you can obviously see that the slope is still negative. Let w subtract a negative number
, that is, add a Value,
so it will move forward
. We keep repeating this action
. Calculate the slope update w
Calculate the slope update w Let’s take a look
Calculate the slope update w Calculate the slope update
And then keep repeating until
we are almost close to the lowest point Or at the lowest point, the slope of the tangent line
here will be quite close to 0. OK
, the slope of the tangent line here will be very close to 0.
After it is very close to 0,
we let w subtract a value close to 0
, which is equivalent to w being If there is no update
, then we will find the lowest point
. This is probably the operation process of gradient descent
.
Next, let’s take a look at what
the learning
rate is. Let’s go back
to the previous example
. At the beginning, we first find an initial w,
and then we can calculate its tangent slope
. After calculating the tangent slope, if we want to go down
, we can subtract the slope from the value of w and multiply it by a Learning rate
The learning rate is that you need to set a value
. You have to decide how much the learning rate is.
Let’s take a look here, multiply the slope by the learning rate
. If your learning rate is larger, the side will be larger. Well , it
may be the larger the positive value or the larger the negative
value. Then we subtract a larger value from w, and the change of w will be greater
, that is to say, its pace will be larger
. Conversely, if your learning rate is higher If it is small, it will be relatively small after multiplying
here
, and then we take w to subtract a relatively small value,
and the change of w will be relatively small
, that is, the pace will be relatively small
. Well, let’s see what happens with
different learning rates. The result of the appearance, let’s first
look at the learning rate. If the learning rate is high
, then his steps will be relatively large. OK, it’s relatively large. Well
, you may have a question when you see this
, is whether his steps are gradually getting smaller. That’s
right ,
he is gradually increasing. Get smaller
, even if our learning rate remains the same
, his steps will gradually become smaller.
Why ?
Because the slope here will also change
. The slope here is relatively large
, and the closer to the lowest point The slope will be smaller , so his steps will also be smaller.
Let’s see what it would look like
if the learning rate is smaller
. Then his steps will be smaller.
Let ’s compare the two pictures together.
Clearly, the one on the left
has a high learning rate, and the one on the right has a small learning rate
. You can see that the strides
on the left
are relatively large, while those on the right are relatively small
. Speaking of this, you may say that we must have set the learning rate to a high value. Ah,
because
it is set to be too large, the speed of going down will be faster,
so that we can reach the lowest point faster
. This is a very good question,
but our learning rate may also be
too large. What will it look like if it is too large
? Take a look,
you came here in the first step,
and then in the second step,
you didn’t go to the lowest point
, you just stepped over to the opposite side
, then let’s look at
the third step. In the third step, he stepped back and
you will In this way, you keep stepping over and over and over and over again, and
you will never reach the bottom point
because your steps are too big and there is no way to reach the bottom point at all.
This is the problem of too large learning rate
. Let’s see if What will happen if the learning rate is too small? Every step
you take is very small, very
small, so small that you can’t go to the lowest point even if you go forever.
So you can’t make the learning rate too large or too small
. To find A most moderate value
, how do we find the most moderate value?
We can find it through continuous experiments and tests.
Well , this is the learning rate
. Then let’s go back to the original gradient descent.
The examples just now are only considering w In this case,
if we have to consider even b now
, how will the gradient descent work? It is
basically the same
. First of all, you need to set a random initial value of w and b .
Here you see this The picture is
also obtained after exhaustive enumeration
, so you don’t know that the lowest point is here , so
we directly apply it to the metaphor
. Today, you were taken to a place inexplicably
and then blindfolded
. This place is a bit The terrain is similar to a canyon
, so your goal is to find the lowest point of this canyon
, but this time is different from last time,
you can not only walk back and forth, you can also walk left and right
, you can also know your front and back through some method The steepness
and the steepness of the left and right
, then you can find the way down through the steepness. The steepness
on this side is the same as before, which means the slope
. So here is to calculate the slope in the w direction
, and here is to calculate the b
The slope of the direction To
calculate the slope , we can also differentiate the cost function.
In this example, the cost function is long.
It is to subtract the predicted value from the real data and then square it.
That is, subtract y pred
from y and then square it. What about y pred here?
We can also decompose it into w multiplied by x plus b,
and then differentiate w to get the slope in the direction of w
. If we differentiate b, then You can get the slope in the b direction. After
the differentiation, the result will be like this
. The slope in the w direction is as long as this.
The slope in the b direction is as long
as this. If we want to know
what the slope is at this point , then
we can just bring the value into it.
Now , bring in the w here, bring in the b here
, and then x and y are also the data we have
, then we can calculate the slope in the direction of w and the slope in the direction of b
, and then we can update it The values of w and b
means that we can go down
. To update w, we subtract the slope in the direction of w from w and multiply it by the learning rate
. To update b, we subtract the slope in the direction of b from b and multiply it by
Learning rate In this
way, we can go down .
After reaching this point, we will recalculate the slope in the direction of w and the slope in the direction of b
. After the calculation, we can update
it, and then we can slowly move forward.
Walking to the lowest point
, we are walking blindfolded
. How can we judge whether we are at the lowest point?
Similarly, when you get closer to the lowest point
, the slope of the w direction and the b direction will be smaller
, and then you If w and b are subtracted by a small value,
it means that there is no change
. We can use this to judge where we are
now. For the learning rate, you have to set a value yourself
. This value cannot be changed. Too big and not too small.
If your learning rate is set too high
, his steps will be very large, so
big that there may never be a way to reach the lowest point
. On the contrary, if the learning rate is too small,
your steps will be very small.
It's so small that you may not reach the lowest point even if you go to the wildest place.
This is our gradient descent. Its operation process
. Then we will directly try it out
. Okay , then we will implement the gradient descent.
First of all, we must read it first. The action of
getting data and reading data is the same as before
, so I directly use the copied one.
Then we just said that gradient descent
is to calculate the slope and then update the parameters.
To calculate the slope in the direction of w
, we can differentiate the cost function with respect
to w. To calculate the slope in the b direction
, we can differentiate the cost function with respect to b
. Let's do the calculation.
Let 's differentiate the cost function with respect to w
. The result is 2 times x and then multiplying
w times x plus b to subtract y
is the result of differentiating the cost function with respect to w
. If we differentiate the cost function with respect to b
, the difference is that one less x is multiplied here, so
the slope in the direction of w is called w gradient,
and the direction of b Well, I’ll call it b gradient
Now if I want to
know
what is the slope of the w direction and the b direction when w is equal to 10 and b is also equal to 10,
then I will first set w to 10 and b to 10
, let’s go first Looking at the good execution in the w direction,
we can see that it has generated a total of 33 values
because we have a total of 33 data. Every data
you bring in will generate a slope
. Here we will average it. If we
want to average,
we can First add it up , and
then divide it to see how many records it has
. Here, how many records are there? I will use n to represent it.
Let ’s calculate the length of x
, and then we can divide it by n.
Let’s execute it again.
This is the average result, which is -118 points.
Then let’s look at the b direction
again. We have 33 data,
so it will calculate 33 values
. Here we do the same
, that is, add it up and divide it by n
Execute again
, and you can see that the average result is -27.46
. In fact, if we want to calculate the average here, we can
also directly write it like this
. We write some mean, which
also calculates the average. There is no need to calculate the length of x separately. You
can see that the result of the execution is the same.
Change the direction of w here,
so they are all the same.
For convenience,
I will calculate the slope or you can say the action of calculating the gradient
and write it as a function.
Here I will Call it a compute gradient
, we need to pass in x and y, which are our data
, and then we can do calculations with the value of wb
, and we will return the calculated results to
the w direction and the b direction
, then I will open another grid to try it out Suppose
what I want to know now is when w is equal to 20 and then b is equal to 10
, what is the calculated slope
? This side needs to be executed first and then execute it.
The w direction is 537 points
, and the b direction is more than 70 points
. Calculate w and After the slope in the b direction,
then we can update w and b , because our update method
is to
take w and b to subtract its slope and multiply it by a learning rate
. We first assume that the initial w is 0
and the initial b is also 0,
here you can set it randomly, I set it to 0,
then we can calculate
the slope when w is equal to 0 and b is
equal to 0, then we can
update w according to this slope and The value of b
To calculate the slope, use the function just written.
We bring in wb, and it
will send back to us the slope in the direction of w and b
. Then we can update the value of w and b according to the slope.
For w, we
can subtract w from the slope in the direction of w
, and then multiply it by
a learning rate. After my tests and experiments
, I think 0.001 is a
good
learning rate. Well, if you want to update b
, the same thing is to subtract b from b in the direction. The slope is then
multiplied by this learning rate . I will make w equal to the result after the update
, and then b is also equal to the result after the update
. Let’s display it
and execute
it. You can see that after the update,
w has changed from 0 to more than 0.87
, and then b has changed from 0 to It’s more than 0.14
, then let’s see if wb has changed from 0,0 to this
. Is it really reducing the cost
, that is, is it really going down?
If you want to calculate the cost, you have written it before
and copied it directly.
The compute cost is directly copied
, and we can use it directly here . Let’s see
if the cost has really dropped after wb originally changed from 0,0 to this new value.
I will print it out and put it
It is printed out and executed.
You can see that it was originally 6,040 and then changed to 5,286
, so the real cost is decreasing
, that is, we are really going down
, so let’s pause for a while. Let’s
take a look at the place where the gradient was just calculated. How is it written?
Find the compute gradient here
. Whether it is to calculate the gradient in the w direction or the b direction,
you should have found that it has a multiplication by 2.
In fact, this multiplication by 2 can be omitted.
Here we can omit it. It ’s
gone. I’ll run it again after omitting it.
Why can it be omitted?
Let’s look down here
. Do we take w
to subtract the slope in the direction of w and multiply it by a learning rate
, then b to subtract the slope in the direction of b and multiply it by a learning rate
. What we just did in the direction of w and b The slope is multiplied by 2,
which is equivalent to multiplying by 2 later.
In fact, this multiplication by 2 is unnecessary
because it will indirectly affect the size of the step
. The size of the step is controlled by the learning rate. That’s good , so
multiply this by 2. In fact, we don’t need to write
it. We can just write the double here. Let’s
just multiply the learning rate by 2
and calculate again. We
can see that the result is the same
, so we The above multiplication by 2
is unnecessary, so just omit it.
I delete the multiplication by 2 and execute it again
. This time, it is 6040 and then becomes 5656.
This is the result of only one update. We
only updated the result of w and b once.
Then
let's try to see what it looks like after updating 10 times. I will use a for loop
to repeat this 10 times and let it run 10 times. Then let's record the values of wb and cost
. The cost here is equivalent to recording it. Let’s
use an f string to write
it. At the beginning, I will first write how many times it is now,
that is, the i-th update of the iteration
. Then we will record the current cost.
If the cost is this value
, then the value of w and b will also be written. Record it and
let’s execute it to see what it will look like
. The cost is this value at the 0th time, then w and b
let’s see if the cost is really decreasing
. It is really decreasing. 5656 to
3161 I let him have an interval , and execute it again
, which is much more comfortable
but
he has a lot of decimal points, so it will look a bit untidy
. If we want him to display only two decimal places
,
we can do this. Write it on the back: .2f
will only display 2 digits after the decimal point.
If you want 3 digits, it will be .3f
and so on. Then
I will only display 2 digits of w and b and execute it again
so that it looks neat. It’s a lot,
the cost is decreasing,
and then the value of w is updated all the way, and the value of b is updated
all the way. Let’s try it out. If you give it 20 times and then execute it,
you can see that the cost is also decreasing all the way
. When it reaches 20 times, only There are 1,705 left
, but now he seems to be untidy again.
The reason is that he occupies two grids by the 10th time
. If we want to make the number of grids occupied by him the same,
here we can Write on the back of him:
Then I want him to occupy a few squares.
Suppose I want him to occupy 5 squares, then I will write: 5 and
then execute it. No matter what the number is, it will occupy 5 squares
, which is much neater .
In addition to calculating the cost In addition to the values of w and b,
I will also record the slope of w and the direction of b
. Here, it will also have a lot of decimal points
, so I will let it only display 2 digits
. Execute it again
so that everything will be fine.
It ’s recorded. Where we see the cost
, we can see that it is still obviously declining.
So I asked him to run a few more times
to see how far he can drop
.
After running 100 times, it still seems to be declining.
But we You can see that the right side
seems to be becoming irregular again, and
this side is becoming irregular again.
The reason is that our numbers still have sizes,
maybe two or three digits , so
they will still become irregular.
Here, if we really want to make it neat, we
can use scientific notation . Here, I can write .2e
and change it to .2e, then it will display two
digits . Others are presented in scientific notation.
What Is it a scientific symbol? Let’s see if
we execute it
. You can see that this is a scientific symbol
. It writes 5.66e+03
, which means 5.66 multiplied by 10 to the third power,
which is equivalent to 5,660
. If what you see here is 5.66 e-03
is 5.66 multiplied by 10 to the power of -3.
We use scientific symbols to represent it
, which is much neater. We see that the
cost is still falling,
so I let him run a little more here .
Let him run 100 times , then
I will let him run 1,000 times
and then execute
it. Let’s take a look
. It seems that it is still falling at
42, and then it is still falling at
41. Let’s let him run it a little more
. I let him run it once.
10,000 times, it is too much to print it
10,000 times, so I asked him to print it only once every 1,000 times.
Here , I judged that if it is divisible by 1,000,
I would print these. Information , execute it again
, then it will only be printed once every 1,000 times.
Well , we can see that
this side seems to be a little messy.
The reason
is that it has an extra minus sign
. If this problem needs to be solved,
we can add one in front of it.
Blank If you add
a blank, it will give up one more bit to represent the symbol
. To indicate this symbol
, let each side give up one more bit
and execute it again.
This problem will be solved .
Let’s take a look at the place where the cost is still declining.
After 10,000 times, it seems to be still declining. He has multiplied 3.51 times 10 to 3.39
. Let’s let him run another 20,000 times to see.
Here, let him display more decimal places
. I asked him to run .4e
and execute
it again, so that he would display 4 digits
, and it continued to decline,
but the decline became very small
. We just let him run it 20,000 times, and we can
see that the decline is slow It's getting smaller and smaller.
Let 's take a look at the slope next to
it. We can see that the slope in the direction of w and the slope in the direction of b are also very, very small
, almost close to 0.
Then I will directly write
this gradient descent process
as a function
, so that It is convenient to use later. I will call the name of
the function
gradient descent
. There are
many things to be passed in. The x and y are our data, the initial w and the initial b,
and then the learning rate
and us. The cost function used to judge the quality
and the gradient function used to calculate the slope
are finally the total number of times you have to run run iter
and how many times you want to print out the data. I call him p iter,
and I let him preset the value of p iter. 1,000
, that is, 1,000 times, it will be printed once
, so here we change it to p iter, and
for 20,000, we change it to run iter
and the initial w is w init
, so I make w equal to
the initial b of w init. B init
then calculates the function of the cost.
You need to change the compute cost
to this function
, and then the compute gradient to this
Then here, by the way, I
store the cost in the process and the values of w and b. For
cost, I call it c hist
and make it a list.
For w, I call it w hist.
b is also b hist
. The three lists are used to
store
the cost and the value of w and b of each time we ran so many times.
Here I store the wb and cost after each update
. So w
hist.append Store w in
and then b is also the cost
, store it in
and then finally I can do the return action
. For the return, I just return the final w and b
and all the w processes
in our process All the w, b, and cost are all returned
so
let ’s implement it
.
He said that
there are some problems.
Create another grid
here. Before using it , we need to set the initial w and b
. Suppose my initial w and b are equal to 0,
and then the learning rate is 0.001
. I can also use science here. How to write the symbol
I write 1.0e and then -3
, that is 1.0 multiplied by 10 to the power of -3,
which means 0.001
. What about the cost function?
Here we need to pass in the compute cost
, which is the gradient function
that we use to calculate the cost. What we wrote is the compute gradient.
Here we can pass in the function as a parameter
. Then
it
is run iter. Run iter
.
I
can’t move it, I can move it, I don’t need to set it,
okay, let’s execute it and see , it will return these 5 values
, so I’ll also write it, this w is the last w, I call it w final
b is also the last For b, I call it b final
, and the stored wb and cost
are good for us to execute. Why is
it only printed once?
Let’s see why it is only printed once
. Oh,
I accidentally wrote it into the for loop in the place of return
So it should be executed
again outside , so it should be no problem.
1000.
Let’s see that the cost continues to decrease
, and the slope is also decreasing.
The final w can be seen to be more than 9.17
, and b is about 27
. I'll also print it out
to see the final w, the final w and b values
, let's take a look,
the final wb is more than 9.14 and then 27.88
. The same here, I let it not have so many decimal points, and let
it display two digits
That’s good , so
I wrote that the final wb of .2f is 9.14 and 27.89,
then we can use this final value to make predictions.
You should not have forgotten
the problem we want to solve.
Let me review it for you.
Suppose today you are a The boss of a start-up company
, you want to hire your first employee
, but you don’t know how much to pay him,
so you go to the market to collect some
relevant information about this position
, that is, his seniority and his corresponding salary
. Now Well, you want to use a straight line to represent these data, and after
the representation , you can use this straight line to predict
how much salary this employee should be given
. Now that we have found this straight line
, we can make predictions. Action,
suppose the employee
who applied for the job today tells you that he has 3 and a half years of work experience
, which means that his seniority is 3.5 years
, then we can help him calculate how much salary he should give him.
Here , his If the seniority is 3.5,
we can use the found w final to multiply it by 3.5,
and then add the value of b, which is b final,
because we use a straight line to represent these data.
A straight line can be written as w multiplied by x Adding b,
now our x is 3.5,
and the unit of this is k.
He said that the seniority is 3.5, and he can give him about 59.88 k
. There are too many decimal points, so
I just give him 2 digits, er, one. That’s ok,
one decimal point is good to execute again
, then we need to give him 59.9 k.
If we predict it, we can probably give him 59.9 k.
There are more blanks here
. Suppose another employee comes
and he tells you that his seniority is 5.9 years. Then we
Just make a prediction. Here
I will type two more words to predict. The predicted
salary is easy to execute
.
If he tells you that his seniority is 5.9, then the predicted salary is probably 59.9k. If not, I will change it to 5.9. Execute
, the predicted salary is about 81.8 In
this way, our problem is solved.
We can predict salary in this way
. Next, I will draw some data into a graph to see . For example,
I can draw it into a graph to see the cost here.
If we want to draw a picture here
, we can use the drawing tool matplotlib
and then use numpy.
Let’s draw the process of the 20,000 update
cost decline. We can use plt.plot to draw it as a line
. The value of x here we use np.arange
20,000 times, so it is 0~20000
and the value of y is the historical data of our cost.
The updated data is stored here,
so we can display it to see Optimistic about the implementation, you
can see that he looks like this
, then I will add some tags and titles for him, and
set his title. The title here is a total of 20,000 updates
and its cost. So I will write iteration vs cost here
and then Then add an x label for him. The
x-axis is the number of updates. It is iteration,
and the y-axis is cost. Execute it again
, so that the subtitle and labels are all up
. From this picture, we can find that the front of
him has dropped very quickly. But
later, it will be slower. If we want to look at the previous paragraph in detail
, suppose I only want to see 100 0~100
, then I will change it to 0~100 here
, and then I will write: 100
also It is the first 100 data,
so it can be executed
again. This is the descending process of its first 100 updates.
If you want to see other intervals, you can
also set it
yourself. It seems that there is one less a iteration , so you can execute it.
Finally,
we can also use w Follow the update process of b.
The 20,000 update process draws it with a picture
. Let’s first go to the previous cost function
and copy the 3D picture. The 3D picture is here
. We copy the action of drawing a 3D picture over
there . At that
time, we set w and b to be between -100 and 100
so
the calculation of w and b from -100 to 100 should also be copied
here . Here, w and b are from -100 to 100 and its cost
Here I will open another grid to do calculations
I will let him calculate first
, then we will use Chinese fonts here
, so we have to do the action of downloading fonts,
the same way I went to the previous cost function
, found the place where the fonts were downloaded, copied them,
and then opened a grid for him to download
, okay here Let's wait for him
and let him finish the calculation, installation and drawing
. OK, the drawing is finished.
I 'll adjust its angle
. It was originally 0 degrees and 0 degrees. I made it 20 degrees and -65 degrees , so I
can execute it again
. The angle is good,
the red one.
The point is the lowest point. Then
we draw the update process of w and b with a line. If
we want to draw a line, we can use the plot to pass in the w hist
b hist and the hist of the cost that
we saved , that is For the c hist
side,
let me run it first and see that he draws the line
, but I don’t know where the initial point is , so
I draw the initial point as well
. If I want to draw a point, I use scatter
Here I directly use the copied one
. If we want to draw the initial point, it is its 0th value
b hist is also the 0th value
cost is also the 0th value.
Here I make the color green and execute it again, and
you can see the green.
The point is our initial position , and then
it updates and updates all the way to this side
, basically reaching the lowest point.
Ok, it’s updated all the way,
so I think the color is a little bit of an eyesore
. The color of the surface is a bit of an eyesore
, so I changed the color I don’t think
it’s necessary to cancel the
border , so execute it again
and it looks better . I
set its opacity to the first point and let it be 0.3 and execute it again
. It looks much more comfortable. It
’s a Such a curved surface. Then
our green point is the initial point
. It is updated along the way. Basically,
it is about to reach the lowest point
, because the red point is the lowest point.
Well , I can also play around here
. If I am making a gradient When descending, I
set a different value.
Here I copy a copy to the following
and copy it
here
. If my initial point here is set at -100 and -100
, let’s try
it and see what the result will be like. Well, after
running
20,000 times, we will also get the final w and b
and the stored wb and cost for each time.
We will execute the drawing here again
to see what it will look like
. You can see this time. The initial point is here , and then
he goes all the way down, down, down, and down...
and then turns around like
this, which is also close to the lowest point
. Then I can try it if I
didn’t let him update it here. Many times
, I only asked him to update 1,000 times for easy execution
. I only asked him to update it 1,000 times. Let me see what it looks like.
He walked down this time
, only came here and didn’t move forward
because we didn’t update enough times If
there are many, you can also adjust other things
. For example, I can increase the learning rate
a little bit.
I set it to be 1.0*10 negative quadratic, which is 0.01 . Then
I will try again here
. He still has reached the lowest point , then
you may find that the initial point we set is not -100 -100.
According
to you, the initial point should be here
. Why is it here?
The reason
is because in fact,
we have stored it here. It is the result of the first update,
so he ran here after the first update
-100, -100 is the initial place here,
he ran here for the first update
, and the storage here is from the first update When the last result
is reached, I will try again
. If I increase the learning rate a little more
, let’s say I change it to 5.9 for easy implementation
. Let’s see what it will look like.
You can see that his stride this time is very large.
He has become like this. Back and forth,
but it seems to have reached the lowest point,
so what if we make it bigger?
Here , I change it to -1 of 1.0*10 and execute it. It’s easy to execute.
You can
see that something terrible happened
. Ours The surface in the picture is gone.
Why ?
The reason is that it has exceeded the range too much. When
we drew the surface, we set the values of
w and b to be between -100 and
100. Now it has exceeded too
much. It will cause the surface to disappear.
When our learning rate is set too high
, it is possible that each update
is not closer to the lowest point
, but farther away from the lowest point.
In our example, every update is farther away from the lowest point.
The farther the lowest point is , so
it will lead to such a situation that
the entire surface disappears. The example
just now is like this.
If you set the learning rate too high, it may go farther
and farther away from the lowest point
. One step to this,
the second step to this
, and the third step to the point where he doesn’t know where he is going.
Here , everyone can play and see by themselves
, set different initial values of wb
, and set different learning rates
and different settings. This
is the implementation of our gradient descent
. After completing the simple linear regression,
I will briefly summarize
the process of machine learning.
In our example
, the first step is to prepare the data first
. According to this The distribution of the data
, we think we can use a straight line to represent it
, so we use a straight line to represent the data,
and then we need to find
the straight line that best represents the data, that
is , the line that is most suitable for the data
. What kind of straight line is the most suitable for these data?
We always have to give him a standard for judging,
so we set it up. As long as
the sum of the squared distances between these data points and this straight line is smaller , then
we will call this straight line.
The more suitable these data are
, on the contrary , the larger the size, the less suitable it is. Well
, after we have a scoring method, we can never exhaustively
enumerate all the straight lines
and then score them all
and find out from them.
The best one , it
looks too inefficient
. We must find the most suitable straight line in
an efficient way. Here
, we use the gradient descent method
. In fact, the whole process
is probably machine learning. The process is the same when
you apply it to other examples
. First of all, you need to prepare the data first
, and then you need to help him set a model based on your data
. In this example, the model
we set is a straight line
. After setting the model,
it contains There may be some parameters that you need to adjust.
In this example, you need to adjust w and b
. How to adjust is called good
, and how to adjust is called bad.
We always have to give him a standard for judging,
so we need to set a cost function
. For this example, we set it like this
. Then it is impossible for us to exhaustively enumerate all parameter combinations
and then calculate their cost and
then use it . It is too inefficient
to find the best
one . We must find the best parameters in an efficient way.
In an efficient way
, that is, we need to set an optimizer
. The optimizer is translated into Chinese. Optimizer
In this example
, the optimizer used is gradient descent
, which is probably a simple machine learning process
. No matter what other examples it uses
, it is actually the same
. You must first prepare the data
and then set up a model.
Then set a cost function
and finally set an optimizer OK.
The second model I want to introduce to you
is called multiple linear regression. It
is multiple linear regression in English.
It
is
actually similar to the simple linear regression we introduced before
, but it can be compared. There are many features
so
let’s take a look at
what problem we want to solve today.
In
the previous example of simple linear regression
, we used seniority to predict salary,
but
think about it carefully. We only want to predict salary based on seniority
right
? It’s weird,
we
should consider more factors
, so now
we have collected more complete information. In addition to seniority
, we also collected his education background and where he works.
So now we want to use seniority, education background and work To predict his salary,
that is to say,
our features have more education and employment
. If we also want to use a linear model to represent these data , then we can use multiple linear regression.
Multiple linear regression uses mathematics. It can be written as y is equal to W1 times X1 plus W2 times X2 plus W3 times X3...
you can multiply all the way depending on how many characteristics you have, and
finally add a b
to that In our example, there are three features
, which are seniority, education and place of work
, so the mathematical formula will be long when written out. What
we want to predict is the monthly
salary. The monthly salary will be equal to W1 multiplied by the seniority.
Our first feature plus W2 Multiply by the second characteristic of education,
add W3 and multiply by the third characteristic to work
, and finally add a b.
Then our goal
is to find a combination of W1, W2, W3 and b
so that he can best express This
information is what our multiple linear regression needs to do
. Before we start to find the most suitable w and b, I
think everyone should have discovered a problem
. In our formula,
we multiply W2 by education
and then take W3 is multiplied by the place of work
, but we can see
that the two characteristics of education and work place are both text.
How to make the text into a number? This is not good
, so we must first combine these two features. Do some processing
and convert them from text to numbers
before we
can do the calculation here. Let’s deal with the feature of education first
. There are three possible values for
the feature of education, which are below high school, university and master’s degree
. From this feature, we can see It can be concluded
that it actually has a relationship between high and low.
If there is a relationship between high and low, can we use numbers 0, 1, and 2 to represent these three situations?
0 is used to represent the smallest high school and
1 represents a university. 2 is a master’s degree or above.
If a feature has a size relationship and
a high-low relationship
, we can use this method to replace these words.
This replacement method has a name,
it is called label encoding,
and we replace the education just now. After the fall, it will become like this.
Come back and take a
look. We use 2 for master’s degree and above,
and 1 for university, and 0 for high school and below,
so here it will become 1 2 0
and so on
. Then we will first Let’s implement this label encoding.
First, read the data in.
The reading action is the same as before
, so I directly use the copied one.
However , in the part of the URL
, because our data this time has two more features.
So there are some changes.
We are on the side of salary data
because it is the second edition, so we write 2 at the back and
read it. We will display it to see
OK. Our data looks like this. There are three characteristics
in total , namely , seniority, education and In the place of work , we
want to deal with the feature of education first
. We want to convert it from text to number first.
Well , we can do the conversion in this way
. Let’s get the data. Let’s
take a look at the feature of education
. It’s called education level
and we’ll get it. This feature
, let’s display it first to see
what he looks like.
We want his university to be 1
, master’s and above to be 2, and high school and below to be 0.
Here we can do it like this, write a map
on the back and
put it in it. Write a dictionary.
We want to correspond to high school and below, let it correspond to 0
, then university, let it correspond to 1,
and finally, master and above, let it correspond to 2.
We write it like this, it will help us do the conversion,
and put this feature into
it The value inside is converted
like this. After the conversion, I will change the value of this feature
. After the change,
we
can display the entire data and see that it will now become the corresponding value. It’s worth it
. The first piece of information here is a university
, the second is above a master’s degree, and the third is below a high school
, so it will become 1 2 0
, and then the same conversion will be done below
. In this way, our label encoding is completed. After dealing with
the feature of academic qualifications,
let ’s deal with the feature of place of work
. There are three possible values for this feature
, which are city a, city b, and city c
. Here, you may wonder whether
we can use the same method
. We use The label encoding
uses 0 1 2 to represent the cities a, b, and c,
but if you think about it carefully, do
you think this is okay?
We don’t know
whether there is a high-low relationship
or a size relationship between the cities abc
. If we If it is represented by 0 1 2
, whoever is 0 and who is 1 and who is 2?
If we want to convert this feature with no size relationship,
we can directly convert it into multiple features
like
this . We originally There is only one characteristic of a city
, so I changed it
to three characteristics of city a, city b, and city c. The first piece of information
here
is that he originally worked in city a,
so we gave him the attribute of city a. 1
Then the two attributes of city b and city c are 0
, so look at the second data. The
second data originally worked in city c.
So we give it the attribute of city c as 1
and others give it 0 and so on.
This method is called one hot encoding
as long as we want to convert a feature that does not
have a relationship between size and height
from text If it is converted into a number,
then we can use one hot encoding
, which will change from one feature to multiple features.
See how many values this feature originally had
, and how many possible values it has, then it will become several features.
Take our example
Say its possible values are cities a, b, and c
, then it will become three features,
city a, city b, and city c
. After converting it like this,
in fact, we can also use one of these three features as Delete why ? The reason is that among these three features
, in fact, we only need to know the value of two of them
to deduce the value of the third feature
. Suppose we delete the feature of city c now. Let ’s see
if there is any way Deduce city c through cities A and B. It looks like this now .
If city a is 1 and b is 0,
then city c must be 0, because only one of these three features will be 1,
so let’s look at the second
If city a is 0 and city b is 0 , then city c must be 1,
because it is not city a, not city b, it must be city c
, and so on . Therefore, we have a way
to deduce the third through two
of the characteristics.
The value of a feature .
At this time, we can delete one of the features . Let’s take a simpler example
. Suppose
you now have a feature called gender . There are only two possible
values , either male or male. Girls , obviously, this feature
does not have a relationship between high and low, and there is no relationship between size,
so we can use one hot encoding
to turn it into two features that are divided into boys and girls.
But it is either a boy or a girl
, so we can
derive it from one of the features.
In this
way, we don’t need to divide it into two features,
because too many features will make our calculations more complicated,
so we can delete one of the features.
Okay, so go back.
In the original example, among the three characteristics
of cities a, b, and c
, we only need to know the values of two of them to derive the third one. In this way, we can delete one of them
. Here we choose to The feature of city c is deleted
. Let me remind everyone
that not all the features that can be derived should be deleted
because
even if some features can be derived
, they have special meanings
or they can speed up our calculations.
Efficiency , then
we won’t delete it
. But in the example of one hot encoding,
we can delete one of the features after conversion
. Well, then,
we will directly implement one hot encoding
. Now our data It looks like this,
we
want to convert the feature of city
, from text to numbers,
so for this feature
, we can use one hot encoding to convert it.
Here
we can use
the preprocessing under
sklearn
and then Well, let's use the one hot encoder
sklearn suite in it.
It provides a lot of
things we will use when doing machine learning , such as
the one hot encoder we introduced now
, which can help us quickly convert
This one hot encoder is a category
, so let’s create it first
, let’s create a converter first
, or you can say create an encoder first
, here I call it the one hot encoder,
after creating it, we can first let This encoder
allows this converter to read our feature
, the feature of the city,
so here we can write some fit
and let him see the feature of city , but
here it is important to note that
it only accepts a binary input It is a one-dimensional matrix
, so we can’t write it like
this because it is a one-dimensional matrix. If we
want to make it two-dimensional, we need to add a pair of square brackets
and let him use two pairs of square brackets. It is a two-dimensional matrix
, let him read all the values of this feature
, and then we can transform it.
We can use the transform under it to do the transformation
. It also passes in the feature we want to transform
, and here is the same
A two-dimensional matrix is required
, so two pairs of square brackets are required . Here I will call the converted result city encoded,
and then we will display it directly.
After execution , you can see that the converted result looks like this. Why does it look like this?
The reason is because of this one
After the hot encoder is converted, it is preset that
what it will send back to us is a sparse matrix
. OK, it is a sparse matrix. It doesn’t matter
if you don’t know what a sparse matrix is. What
we want to see here is
that it will send us back a The
complete matrix, we can write toarray later
for the complete matrix, so it will return the complete matrix to us and execute it
again .
You can see that this is the result we want.
It will put a total of 3 in a feature The possible cities a, b, and c
can be turned into 3 features
OK and 3 values
. Then we can
replace the result of the conversion with the original feature
and replace the original feature of the city
. We can do this
Our original data is data.
Let’s execute it once
OK. The original data looks like this.
If we want to add 3 more features for him
, that is, if we add 3 more columns,
OK is 3 columns . If we want to add 3 more columns and 3 more features for him
, we can write it like this.
Write
two pairs of square brackets
. I want to help him add a cityA,
then a cityB,
and finally a cityC,
and then we Its value can
be specified as the result after the conversion we just made
, and then we can display it to see. You can
see that writing it like this will add 3 more columns
and 3 more features cityA BC
and then we can put The feature of city is deleted
because it has been converted into these 3 features.
Then, among these three features,
we can delete one of
them. I delete cityC , then
we can do it like this
. We can write data.drop
Then write the features we want to delete in the first parameter. The features
we want to
delete are represented by a list
. We
want to delete the two features of city and cityC. Then write the second parameter that we want to delete. Where is the axis of the
deleted thing? The axis we want to delete is a column.
City and cityC are two columns. If the thing
we want to delete is a column
, then we need to specify the axis axis as 1.
If we If the thing you want to delete is a line
, then here we need to specify the axis as 0
Okay, so here we write 1,
and then we will display the deleted result to have a look.
Execute
, you can see that
the two columns of city and cityC are
gone now, so we will put the work place This feature is also processed .
After converting the text , we usually divide
the data into a training set and a test set when training the model.
Training the model is to find the most suitable parameters
. In our example , it is to find the most
suitable parameters.
Appropriate w and b
Usually when we train the model, we
don’t use all the data
, we only use part of it,
what to do with the other part
, and the other part is for testing
, because you think, if we Use all the data for training
, and then find out the best set of w and b
, but how do we verify the effect of this set of w and b
? Do we want to test it
? If we want to test it We will test it on unfamiliar data.
We will not use the training data for testing
, because the training data machine has already seen it,
which means that he already knows the answer
. If he already knows the answer,
we If we still use this data for testing, it won’t be accurate,
so in order to have unfamiliar data for testing
, we usually divide the data into training sets and test sets,
like this
, assuming we have 10 data
. Usually, the training set will account for about 70% to 80%.
In this example, we account for 80%
, so 8 data will be the training set
, and then 20% of the data
, that is, 2 data will be the test set.
The training set is to take To find the best w and b,
the test set is after we find the best w and b
and use it for testing to see how it works
. After understanding what training sets and test sets are, we will directly implement them.
OK Now our data looks like this.
First of all
, let’s separate x from y
. Everyone should remember that the model
we want to use is multiple linear regression
. Multiple linear regression is written as a mathematical formula
, which is
y=W1*X1+W2
*X2 is multiplied all the way down to see how many characteristics you have.
Finally , add a b
. In our current example, there are 4 characteristics
, namely seniority, education and job location . The y we want to predict will be salary , so we First separate x and y. Here,
x is equal to the four characteristics we need to obtain it
. The first is seniority,
the second is education , then cityA and cityB
, and then y will be equal to our salary.
Here I display x and y to see that
x is these 4 features
, and then y
will be the salary. Then
we can divide it into a test set and a training set
. Here we can also use sklearn, which is very useful. The tool
uses the model selection under it
, and then we introduce the train test split under it,
and then we can use it to help us divide it into a test set and a training set.
In the first parameter, we pass in our x, which is the feature,
and then the second Enter y in the first parameter,
and then we can specify the size of the test set.
Here we can write the test size
. I want it to be 2. Here,
I can make it equal to 0.2
and write it automatically. Take 2 components of our data as the test set
and the other 8 components as the training set
. If you want the test set to account for 30%
, you can write 0.3 and so on
. If you write it like this, it will return 4 The values for
us are x for training, x for testing
, and y for training and y for testing,
so I will call the first one x train
and the second one x test
and then y train and y test
will return these 4 values to us
, then I will display it for a look.
I will display x train, which is the x we used for training.
You can see it and take out these data as a training set
. Let's take a look at its length
. Let's take a look at its length
. Its length is 28.
Let's take a look at the original length of x.
The original length is 36.
Let's take 20% as the test set, which is 80% As a training set,
let’s take 36 and multiply it by 0.8 to see what it is. It
is 28.8
. So, if it is 28.8, it will automatically eliminate the point 8.
Take 28 data as a training set
. Let’s take a look at the length of the test set. How much
can you see in total? x has 36 pieces of data
, and after we divide it
, 28 pieces are used as training sets and 8 pieces are used as test sets . The same is true for y here, and
we display x train
Come out and take a look here, and
you will find one thing here
, that is, the
x train seems to change every time we execute it.
Let’s take a look at the first data. It is 5.1 0 and then 1 0.
We execute it again
and it changes to 6.9 2 1 0 Execute it again
and change to 7.8 2 0 0 Why did it change?
The reason is because of the splitting process.
It will split randomly for us by default
, so the result of each split will be different.
If you want him to split If the result is fixed,
here we can also set another parameter
called random state,
and then you can specify a number for it.
Each number will correspond to different segmentation situations
. Suppose the number I gave him is 87
, so we
can execute it again. Seeing that the result of its division this time looks like this,
the first stroke is 4.6 1 1 0.
Now, when I fix the random state equal to 87, I
execute it again
, and you can see that it will not change.
But if I change this number to
It makes changes.
If I change it to 86
and execute it again,
it will change again
. If I continue
to execute it without changing 86,
it will not change. So
if you want to fix the split result here
, you can give He specified a number,
here I will specify 87 for him
because I want the result of its segmentation to be fixed
, so that it will be convenient for me to do a demonstration later
. Okay, then I will also display it in the x test to see how long
it is like this. 8 pieces of data
, then y train will also display it
OK, there are these,
then y test
is done, then we have successfully divided the test set from the training set
. Finally, for the convenience of subsequent calculations,
I will first put x train and x test convert it into numpy format,
now they are both in pandas format
, so it will be very beautiful after execution,
OK, there will be such a grid
, if I convert it into numpy format, it will look like this,
I can write after it to numpy
, it will become a matrix, and it will be ugly if it
becomes such a matrix
, but it will make my subsequent calculations more convenient
, so here I will convert it first
. The x test is the same,
and I will also convert it It can make my follow-up calculations more convenient
. After converting the text into numbers
and dividing them into training sets and test sets,
we can return to the model.
The model we want to use is multiple linear regression
, so it can be written like this
, but our features have changed from 3 to 4.
The place of work
has become two features of cityA and cityB
, so we have to rewrite it
Now that
it has become like this, our goal now
is to find a combination of W1 2 3 4 and b, so that
the monthly salary predicted here can be closer to the real data
. Well, let’s implement this part
first . Let me set it up first.
The value of w and b
, at the beginning, I set w randomly, it has 1 2 3 4,
so I use a one-dimensional matrix to represent it
. There are 4 values in this one-dimensional matrix , so here I first introduce numpy and call it np
Next, I create w
and let it be a matrix with 4 values.
Suppose I let it be 1 2 3 4.
This 1 2 3 4 means our W1 2 3 4
and then we need to set a value of b
In the same way , I set it randomly and let it be 0.
After setting w and b,
we want W1 to multiply the first feature
, W2, the second W3, the third W4, and the fourth . Now because
We are in the training phase , so
the feature we want to multiply is in the x train
. I first display the x train , and
it looks like this
. It has a total of 4 columns and each column represents a feature
, so what we want is
this 1 is multiplied by the first column, then 2 is multiplied by the second column,
3 is multiplied by the third column, and 4 is multiplied by the fourth column.
If we want to multiply like this
, we can directly write it like this.
We directly multiply x train by w
, then it will be It will automatically multiply 1 by the first column, 2 by the second column,
and so
on. Let’s execute
this. This is the result of the multiplication. The first column
here is W1 multiplied by X1 , the second column is W2 multiplied by X2, and so on.
Then what we want is for them to add
W1 multiplied by X1 plus W2 multiplied by X2 and then add 3 plus 4
, that is, we want to add each row here
, we want to compare each row If you add it, you can do it like this.
First put the calculation result here in brackets
, and then use some sum.
If I only write this way here,
it will sum up every value here
. But what we want is every Well, what
we want is to sum up in a row.
Then I can set the axis equal to 1
and equal to 1 to do the sum in the horizontal direction.
If you want to do the sum in a straight line, you can also set it to be equal to 0
Okay, let's try to execute it
. This is the result of the sum
. Each value here is the sum of each line just now
, which is the result of multiplying W1 by X1 plus W2 by X2 plus 3 plus 4.
Here After all the additions are done, we need to add a b
at the end
, so if I add a b here, if I add b
here, it will add b to every value in it
, but now we set b to 0, it seems to be It doesn’t come out
, so I set it to 1
, let’s execute it
again, then add it like this, every value here will add 1
, so every value calculated here
is our predicted salary,
I call it y
pred Next, we want to find out the most suitable combination of w and b.
To find out the most suitable combination of w and b
, we must first define what is called the most suitable
, that is, we need to set a standard for judging.
That is, we are going to set a cost function.
The cost function here
can be set the same as the previous simple linear regression
. Because we also want the predicted monthly salary
to be as close as possible to the real data.
So I set it as real. The reason for
subtracting the predicted value from the data and
then square it is because there may be negative numbers in the subtraction here. For the convenience of calculation,
we
directly square it so that there will be no negative numbers
. So
what is our current goal?
It is
to subtract the predicted value from the real data and then square
it . The smaller the value, the better
, that is, the smaller the cost here, the better . Then we will directly implement the cost function
. Our cost function is set to subtract the real
data. Drop the predicted value and then square it.
The predicted value has been done in the previous step
. The y pred here is the predicted value. Let
me display it first
. This is our predicted value. What about the
real data because we are now In the training phase
, the real data is y train , so we use y train to subtract y pred.
I also display y train first to see
if it is OK. There are a total of these sums.
We use y train to subtract y pred
, which is After subtracting
each sum, we will square it for
easy execution. You
can see each value here
. It is the result of subtracting the predicted salary from the real salary
and then square the result
. We hope The smaller the value here, the better. If you want
the value here to be as small as possible
, we can calculate its sum or calculate its average.
Here I will calculate its average. I
hope its average should be as small as possible
. Calculate it You can enclose it in brackets and write “mean” at the end for
easy execution.
You can see that this is the average result
. Here I will directly write the cost calculation process as a function , which
can facilitate us to bring in different w and b.
I call it compute cost.
It also passes in x and y, which are our real data
, and then passes in the values of w and b
. In it, the value of y pred is calculated first, which is our predicted value
, but here is x In train, change it to x
to calculate the predicted value, and then we can calculate the cost.
The calculation of cost is here, and
the words here are the same. In y train, I changed it to y, the
real data, subtracted the predicted value, squared it, and then took the average.
This is If the cost
is good, the last thing is to return the cost
and try it directly.
Here we will use it directly. The compute cost
here is to pass in x train and y train
because we are in the training phase now.
Then w and b are the same as I pass in first.
Enter these two values ,
first pass in these two values
, let’s execute and see, and you can
see that the calculation result is the same as above
, if I set b here to 0, then set this to 0 2 2 4
Well, execute it
again. The cost calculated by the combination of w and b in this way is even higher than the one just now. The one
just now is 1,772 .
How about this
? Let’s define the evaluation standard, which is the cost function. After
setting the evaluation
The standard is after the cost function,
and then we need to use an efficient way
to find a set of w and b, which
can make the cost as low as possible. The efficient way is to set an optimizer, which is the optimizer
. In our example,
we can also use the gradient descent method
. You should remember that
it changes the parameters according to the slope
. However, when we used the simple linear regression before,
the parameters were only two w and b
. Now our parameters are Becomes 5 W1234 and b
Let's take a look at how to update the parameters before.
If we want to update w
, we need to subtract the slope in the direction of w * learning rate from w. To
update b, we need to subtract the slope in the direction of b * learning rate from b
. In fact, the parameters are updated. The method is the same,
but
we have changed from two parameters to five parameters
.
Now we have five parameters W1234 and b. To update these five parameters,
we need to subtract the slope of their direction*learning rate individually .
How to calculate the slope in each direction ? In fact,
it is the same as before . We only need to differentiate the Cost function
to get the slope in each direction
. Our Cost function
is to subtract the predicted value from the real data and then square it to
write it in a mathematical formula.
The sub is to subtract y pred from
y and then square it. We can spread out the y pred to become W1*X1+W2*X2...
multiply to 4 and add b at the end
if we want to know W1 The slope of the direction is to differentiate it with respect to W1,
or to be precise, to make a partial differential with respect to W1
. If what you want to know is the slope in the direction of W2,
then you need to make a partial differential with respect to W2
, and so on. W3, W4, and b are both It’s the same. It’s okay if you don’t know what differential is at
all here
, because there are many tools that can help us to do differential automatically,
so the slope in the direction of W1 looks like this after calculation, and
you can see a very long string , but
in fact, we can I found that this string of
W1*X1+W2*X2 plus 3 plus 4 and then adding b
is actually y pred,
so I simplified it and it will look like this
, it is 2*X1 and then multiply Use the result of y pred-y and then
use the same method to calculate the slope in the direction of W2. It will grow like this
. We can find that in fact, he just replaced X1 with X2, and
everything else is the same
. From this we can see that the slope in the direction of W3 is the If the side is replaced with X3,
then the slope in the direction of W4 is replaced with X4, and
the slope in the b direction is calculated to be longer.
Here , there is no need to multiply an x
. Here we can see whether
it is in the w direction or in the b direction. In
the previous place, there is a multiplication by 2.
I don’t know if you still remember.
In fact, we have said that this multiplication by 2 can be omitted
, because when we update the parameters
, we will multiply it by another learning Rate, so
this is multiplied by 2. In fact, we can leave it to the learning rate to
multiply. Here, we can omit it all
and become like this
, so that we know how to calculate the slope. After
we return to the original place
, the slopes in all directions are now. We will forget it
, but it will be multiplied by a learning rate later
. How to set the learning rate?
Just like before, it is through testing and experimentation
. You can’t set it too large
, because it may directly cross the lowest point
. Or it is farther and farther away from the lowest point
, so you can’t set it too small
, because it may reach the lowest point forever. After
setting the learning rate and calculating the slope
, we only need to keep updating the parameters.
Let it approach the lowest point step by step
, and then we can find the most suitable combination of w and b we want
. Then we will directly implement the gradient descent.
First, we will calculate the slope in each direction
because the slope in the b direction is relatively simple.
So I calculate it first, I
call it b gradient , it is y pred minus y,
now it is the training phase, so our y is y train,
this y pred has been calculated before
, so I just copy it , but we Now it is the training stage
, so I will change the x here to x train
, and I will display the calculated results to see
, because our training set has 28 records
, so there will be 28 values here
, so let’s do it
An average , put
it in brackets and then use the point mean
to calculate
the average
. The average is -46.94 . After
calculating the slope in the b direction, let’s calculate the slope in the w1 direction
. I call it w1 gradient,
which is y pred minus Drop the y train
and then multiply it by x1
. This x1 is actually our first feature
. We are in the training phase now,
so our feature is in the x train
. I will display the x train first.
Seeing that it shows a total of four columns
, these four columns are our four features
. Here, x1 is the first feature,
which is the first column here
. If we want to get the first column here,
we can write it like this. Write a pair of square brackets and then:, 0
means
that we want all the values in the first dimension
: means all of them, and
the comma 0 means the values in the second dimension.
We only need the most The first one,
we can see that this matrix
is a two-dimensional matrix
, we want all the values in the first dimension
, that is, we want all the values in the first square brackets
here . We all need
the first value of the second dimension. The
second dimension is every row here.
We only need the first value , which
will be the first column here .
We execute it, and we
can see that it will take out all the values in the first column
, which is our x1,
so here I can replace it
, replace it
, and then we will display the calculation results
to see. It will calculate a total of 28 values , because
we now have 28 data in the training set
, so let’s take its average
, enclose it in brackets and then take the average and execute it
, so that
the slope in the direction of W1 is calculated
295 -295.8
and then the calculation method of W2W3W4 They are all the same
, so here I directly use the copied one
. Here we change it to 2, 3, 4
, and then we want to get the second column
. Here is 1, then 2, 3
, and we will display the results of the execution. Come out and see
W1, W2 OK execution,
you can see that this is the slope of W1 W234 in their direction
, here I use a loop to write it, it will be more convenient
, so I recreate a w gradient
, which stores the slope of W1234 direction,
here I will First create a matrix with all 0s in it
, let it have 4 values
, so I can display it first, and I will let it be 0 first if there are 4 values
in it
, but it seems not good to directly write 4 here.
The meaning of 4 is
How many features do we have? How many ws do we have
because we have several features
? If we want to know how many features
we have, can we directly get x train?
If we want to know how many columns x train has
, we can do this here. Let's get its shape and it
will tell you that it has a total of 28 rows and 4 columns
, that is, it has 28 values in the first dimension
and then There are 4 values in the second dimension
. The value of the second dimension is what we want
. We want to get the value of the second dimension.
Here , I can get 4
with square brackets 1.
So this It will be better if I change it to this
Then I use a loop to ask him to calculate the slope here.
He will run it four times in total
, so I will paste it here directly
, and the i-th slope of w gradient
will be the calculation method here.
But here I will change it to i
, so it should be no problem.
I will also display the result after the calculation
. You can see the slope W1234 in the four directions
is these four values.
Well , what about this
? Calculate the slopes of the w and b directions. This
is the w direction and this is the b direction
. Next, I write the process of calculating the slope as a function
so that it can be used later
. I call it compute gradient
. Pass in x and y, which are our real data
, and then there are the values of w and b
. Here we first calculate y pred . Then
y pred has already been calculated. I will paste it directly to
the calculation of y pred. After finishing, I create a blank matrix
, create a matrix with all 0s in it
, and then change this side to x
and then use a for loop to calculate the slope of each w,
here also changed to x and here changed to x
Then after the calculation here, we
still need to calculate the slope in the b direction
. The slope in the b direction is also pasted here
.
It is also changed to y here. After the calculation, we will return the result to
w gradient and b
OK to execute
. Then we Just use it directly and see here
. The x to be passed in
is now the training phase, so x and y are
both x train y train w and b. I pass in the value here
, so let’s execute it
and see. The result is exactly the same as above,
w and b are the same.
Let’s make a change. Let’s say I change b to 1
and then change this to 1 2 2 4
and execute it again
. Then the calculated slope will be different.
It’s easy to calculate. After the slope,
then we try to update the
parameters. To update w, we need to subtract the slope in the direction of w from w
, and then multiply it by a learning rate
. It is the same when updating b
. Just change this side to b
and then the learning rate here.
Let me set it casually. Suppose I let it be 0.001 as
the initial value of w and b. I also let it be like this.
The slope in the direction of w and b. We will directly call this function to do the calculation,
and it will send the slope back to us.
In the direction of w and b,
here I will display the updated results of w and b to
see if there is any difference. You can
see that it was originally 1 2 2 4
and then it becomes 1.2 more than 2.0 more than 2.0 more than 4.0
and then b There
are also changes. Let's see if its cost
has really decreased after such an update
, that is, has it really gone down?
Here, I will directly use the function here to calculate the cost.
Calculate it before it is updated
, and then I will print out the result
. After the update, I will display it again
to see if it is really smaller
. You can see that it is really smaller. It was
originally It is more than 1,800 and then becomes 1,675.
After confirming that it has become smaller, we need to repeat the actions
here to let him keep updating the values
of w and b. Then we write it as a function
called gradient descent
In fact, we have already done
this function in the previous simple linear regression. The writing method is basically the same
, so I directly copy it, so we can execute it directly
. After execution, we can call it
directly. We have called before
, so I copied it directly
, but there are a few places to change
here . The initial values of w and b here, I changed it to this
learning rate, I set it to 0.001,
so This side can also be written like this,
I just let him run it 10,000 times first , well, what about this side
because
we have now been divided into a test set and a training set
, so here we are now in the training phase, I write x train and y train
after The places are basically the same and we don’t need to change it
.
The names of compute gradient and compute cost are the
same. Well, I will execute it
and you can see the errors it finds
. The reason for the error is because of the error here.
We use the numpy matrix format for
w and w gradient.
Here , we use the numpy matrix format
for w and w gradient.
If we use the numpy matrix format
, there is no way to write directly: .2e to put here Format conversion
, so let me delete it first and see
if we execute it again
, so there should be no problem
. You can see that there is no problem.
You can see that the cost is decreasing
. Then the value of w The value of b and b are both being updated here
, but if we don’t format it like this
, it will
look ugly and uneven
. Here are the slopes in the direction of w and the slope in the direction of b
because w has W1234, so there are 4 values
here , and there are W1234, so there are 4 values
, but now it is ugly to execute.
If we want to make it more beautiful
, we can directly set the format of the numpy matrix that it prints out
. By the way, we can set np.set_printoptions
to set its formatter. If it is a dictionary
, then I want the format of the floating-point number to look like this. The colon is
blank. The format of 2e is written at the end .
How about setting it like this? It is equivalent
to writing in this way,
it is equivalent to writing in this way, it will print out every value
in this matrix in this format
, then let’s execute it again
here ,
and you can see here It
has become much more beautiful. Let’s take a look at the place where the cost
has been decreasing. There is no problem
. After running 10,000 times, we can find that
the cost seems to be still decreasing.
Here
I will test to see if I let Its learning rate is a bit
higher , well, it seems that it is still declining
, and the decline is obviously faster than before,
so there should be no problem with this learning rate
. Slowly, it seems that the speed of decline is getting slower and slower.
Here , I will first set it like this and then let it It’s good for him to run 10,000 times
. Here, everyone can go and play for themselves
. Set different initial w, different initial b,
different learning rate, and different times of running. What kind of results
will there be?
Here I will put it at the end. Find out w and b
and display it to see.
You can see that the w and b it finally found are these 5 values
. Here I want to verify whether these 5 values are good or not.
We can use it in On the test set, we
take the w we finally found and multiply it by x test, which is the test set. This multiplication is W1*X1 W2*X2, and
after multiplying 3 and 4,
we need to add them together, so I put
It is enclosed in brackets to add it.
I want it to add in the direction that the axis is equal to 1.
At the end, we need to add another b
plus the b we finally found.
The result calculated here is our prediction on the test set.
I call it y pred
and we can compare
the difference between the prediction on the test set and the real value
. Here I display it as a Pandas dataframe The format will be better. It
will have a grid
. I let him have two columns.
The first column is our prediction result y pred on the test set
, and the second column is the real data y test on the test set.
Let’s see if there is any difference. How many executions? If there is an error
here , type an extra u and execute again
.
It is
still possible to execute again if there is wrong data.
You can see
that the column on the left is the result we predicted on the test set,
and the column on the right is the real data on the test set
. We predict It is more than 4.0 and the real 43.8
does not seem to be much different.
67.7 72.7 seems to be worse.
61.6 60 OK. The difference feels
okay. Well, then we
successfully implemented the gradient descent
and found the final w and b
. Use the final set of w and b on the test set.
The result is like this
. Finally, if we want to know more accurately whether
this kind of prediction is good or not , we
can also calculate it besides directly looking at their errors. Because the value of
cost
is our criterion for judging good or bad,
so here we can calculate its cost
, then we directly use compute cost.
Now we are the test set
, so x is x test, y is y test
, then w and b are We finally found the w final and b final,
let’s calculate and see,
its cost is more than 18.1 , then let’s take a
look and compare
the cost during the previous training
. When we are training,
the final cost is reduced to 2.52*10. It is more than
25.2, which seems to be no problem
, because our cost on the test set
is smaller than that on the training set,
which means that
its performance seems to be better than ours on the training set
, so after the cost calculation and the above crossover After the comparison,
if the error is acceptable to you
, we can apply this model to the real situation.
Suppose
someone actually came to interview today. He told you that his seniority is 5.3 years,
and his degree is above a master’s degree. After the interview in city A
, I think I want to admit this person, but I don’t know how much salary I should give him
. At this time, we can use the trained model
to predict how much salary we can give him
. First, we must first Do some processing on the data
, because there are texts here, we need to convert it into numbers first
, how to convert the data on the test set, and how to convert the data
in the real situation
, I will directly convert here.
Seniority 5.3 is no problem
Use it directly. After
master’s degree or above, we use label encoding to convert it into 2
and then work.
We use one hot encoding to turn it into 3 features
, and then delete one of them, so it will become 2 features
. So here is city A
The third feature is 1, and the fourth feature is 0.
Here I use a matrix to represent it.
I call it x real
, which is equal to np.array.
If you have feature scaling here, it
is the same in real situations. We also need to do feature
scaling. How about feature scaling? Let me look up and see
where our feature scaling is done. Here is
how our test set is scaled
. Then we will do scaling in the real situation.
OK, I will directly copy
ours . The x test is scaled in this way
, so the x real is also processed in the same way
. Let’s display the zoomed results to see. He
said that there is a problem
. Here he said that only two-dimensional arrays are accepted
, not accepted. It is not
wrong to enter one-dimensional. I forgot to
transform. It only accepts two-dimensional matrices
and does not accept one-dimensional ones.
So here I add a pair of square brackets
and it will become two-dimensional
. It can be
executed again. Seeing that this
is the result after zooming, then we can apply it to the model
. Is this how our model is calculated?
I just copied
it here. What we want to apply here
is x real, so here
Change it , then let’s
display the last one as y real, and
you can see the prediction of the final model
. It tells us that we can give this person a salary of 6.55*10
, which is 65.5k
. If a second person comes for an interview today and
he tells you that his work experience is 7.2 years
and he has a degree below high school, then
the place of work is in city B.
If we want to predict how much salary we can give this person
I ’ll just add this piece of information and write another one here
directly .
It’s 7.2 and below high school is 0 for us. After the conversion, it’s 0.
City b is 0, 1
OK after the conversion. This is the second person
We do the same prediction execution,
and you can see the prediction given by the model
, that is, you can give this person a salary of 23.4k
. This is how to apply the model to the real situation
. We have already done the gradient descent in the previous step.
In fact, In our example,
we can accelerate the gradient descent. We
only need to use a small technique
called feature scaling in English
, which can allow us to achieve the effect of acceleration
.
Well, let’s take a look first. The current example has four features,
namely seniority , education and work places cityA and cityB
, and then we use multiple linear regression to predict his monthly salary
, so it will be written like this
: W1*X1+W2*X2 plus 3 plus 4, and finally add
X1234 on the side of b. For
our four characteristics
, we can find
that their distributions are different from the values of these four characteristics. For the first characteristic, its distribution range is between 1 and 10
, and the second characteristic is 0 to 2.
The third and fourth features are either 0 or 1,
so the range of their distribution is like this.
We can clearly see that
the distribution range of the feature x1 is larger than the other three features , so we take it
Going back to the original formula, it
will look like this.
Our w1 is multiplied by x1
, so it is multiplied by a relatively large value
, and W234 is multiplied by a
relatively small value. If so ,
let our gradient descent be slower
, that is, make our gradient descent slower.
Why ?
Because you can see here that W1 is multiplied by a relatively large value
, and W234 is multiplied by a relatively small value
. In other words, this W1 only needs to be multiplied A slight change
will greatly affect the calculated result here
, because it is multiplied by a relatively large value,
it will greatly affect the calculated result here
, and indirectly affect the calculated cost
That is to say, as long as there is a slight change in w1, the
calculated cost will also change a lot
So if we use W1 and W2 to compare their corresponding costs here, it will probably
look like this
. Here I use a contour map for comparison.
The center point is the place with the lowest cost, and the cost is higher as you go to the outer circle
. Then the x-axis is W1 y The axis is W2.
We can see that this contour map is long and narrow.
Why ?
Because W1 here is multiplied by a relatively large value , so
as long as there is a slight change in W1, it will greatly affect the cost here.
We said that the lowest point is the place of the red dot
, and the further you go to the outer circle, the greater the cost.
Here you can see that as long as there is a small change in W1
, the change in cost will be great
. On the contrary
, the change in W2 is not so good.
Affecting the
value of cost Let's take a look at what will happen if we do gradient descent in this case.
Suppose our initial point is here
, the initial w1 2 is here,
we update the gradient descent parameter
, it is very likely to happen like this It
can be found that it oscillates here and there.
Why? Because
we said
that as long as there is a slight change in w1, it will have a great impact on the cost.
When we are doing gradient descent to update W1,
it is very likely that it will happen accidentally. The update is overdone,
so if you accidentally update it here
, you will end up here , and if
you accidentally update it
, it will cause such a back and forth oscillation,
and this will make us reach the lowest point very quickly.
Slow , that is , our gradient descent will become very slow
, so how do we solve this problem? It
is very simple
. The reason for this problem is that the size and range of the features are different.
Our first feature has a relatively large range and then
The others are relatively small
so
this problem is caused. If we want to solve it,
we can change the scope of the features to the same
. I will change the large here to small. Let’s
see that the four features are all in the same range.
Afterwards our
contour map will look like this
. If we are doing gradient descent now,
it will be
very smooth , and we will go directly to the lowest point . So , we only
need to scale the size of each feature to the same range.
There are many ways to speed up our gradient descent
to do feature scaling. Here I will introduce a very classic and commonly used one
called standardization in English.
Its method is to subtract our feature
from the average of this feature and then divide it by The standard deviation of this feature, in terms of our seniority feature
, is to subtract the average of seniority from each data of seniority
and then divide it by the standard deviation of seniority. It doesn’t matter if you don’t know what standard deviation is,
because
there are many tools.
It can help us to do calculations automatically
, so if we
standardize each feature, it
will look like this. You can see that their size ranges will be very close
. You may have doubts about the three behind
us . Aren’t the features inherently small?
Why do we
still need to do feature scaling? There’s nothing wrong with them
; Let's implement feature scaling directly
. If
we want to do feature scaling, we do feature scaling after dividing the test set and training set.
We divide the test set and training set here,
so we do feature scaling after they are divided.
The way we want to do feature scaling
here is standardization. If we
want to use
standardization, we can directly import the prepossessing under it in sklearn and then introduce the standard scaler
. We can use it like this
because it is a category, so I will create it first.
It is called scaler.
After the creation, we will let him read the feature data of the training set . Note
that he can only see the feature data of the training set here
, but not the feature data of the test set
, because we only have the data of the test set. It can only be used when doing tests.
Here we can write some fit
and pass it in x train .
After passing it in
, it will calculate the average, standard deviation, etc. of the features in it.
Well , after it is calculated
, we can Let’s convert it.
To do the conversion, we can write some transform
and pass it in the same way as x train , so that
it will do the conversion
. Let’s replace the original result with the result of the conversion.
Well , let’s display it to see
. This is The result after conversion
, here I also convert the x test by the way
, here we just do the conversion first and don’t use it
, so that
it will be more convenient when we do the test later, and here we just replace the converted result directly Pay special attention
here
. We will not calculate its standard deviation or average
based on the data of the test set,
and then do the conversion.
We will directly use the result of the training set fitting
, which is calculated by the training set here. The resulting standard deviation and average
will be directly used on the test set for conversion
, so I will display it directly to have a look.
This is the result of the conversion of the test set
, and then we will compare it directly
. After feature scaling, we will do it again.
What is the difference between
gradient descent
and gradient descent without feature scaling ? The gradient descent we made before is here
. This is the result.
I will directly copy a copy
and execute it again here. All the parameters
here are the same
. The only difference? It is because our x train has undergone feature scaling
, so I will execute it again under the same conditions
to see if it is really faster.
You can see that its cost will reach 2.5*
when it is updated about 1,000 times
10 times, and then the following 2,000 to 3,000 updates
, basically the cost has not moved,
so we can guess
that it has almost reached the lowest point.
Let ’s compare the previous results
. It has been updated and updated to about 4,000 times.
It has only reached the first power of 2.5*10
, so we can clearly see that
after feature scaling, we are
indeed doing gradient descent, and its descending speed is relatively fast
. Next, let me introduce to you
that we are very good at classification problems.
A model used is called logistic regression. English is logistic regression
. Let’s first understand the problem we want to solve today
. Suppose we know whether a person has diabetes or not
. It may be related to his age, weight, blood sugar and gender,
so we will collect it.
After collecting some information, each piece of information
here represents a person ’s age, weight, blood sugar and gender, and
whether he has diabetes. If 1 means he has diabetes, 0 means he has no diabetes
. So we just want to base on these 4 Features
To predict whether this person has diabetes
, there are only two possibilities for the output value of whether he has diabetes,
either 1 or 0
, that is, he has diabetes or he does not have diabetes
, which is much different from our previous two examples. In our
first two examples, when doing simple linear regression and multiple linear regression,
The prediction is the monthly salary,
so the output of the monthly salary may be infinitely diverse
. Unlike here, there may only be two
types of output. Either you have diabetes or you don’t
. The possibility of this output is limited.
We will say that it is a classification problem. It’s a little vaccination on
the classification
side. These materials are randomly generated by me
, so if you have a medical background, don’t take it too seriously,
and you may also say
, isn’t it true that if your blood sugar exceeds a certain level, you will have diabetes? Let’s treat it
like this first. I don’t know
, so let’s take a look at it with a graph
. This is an example of linear regression before.
We want to use seniority to predict salary
. In fact, the characteristics can not only be seniority,
but also work place, education, etc.
Here It is for the convenience of presentation, so I only show the seniority
. We said that we can use a line to represent these data
. Let’s go back to our example today.
We want to use a person’s blood sugar to predict whether he has diabetes
. The characteristics of this side can also be Many of them
can have the weight, age, gender, etc. just mentioned
. Here, I will only draw a feature of blood sugar
for the convenience of the graph.
We can see that the distribution of the data here is much different from that on the left.
Because here Its output value has only two possibilities
, either 1 or 0
, that is, with diabetes or without diabetes
, so it will look like this
. At this time, you still think that we need to use a line to represent these data, which
seems wrong, right
? What should we do?
In fact, we don’t need to make too many adjustments
. Just bend the line a little bit to look like this. Does
it seem like it can represent these data?
The value of this curved line can only be between 0 and 1
. It is very consistent with our data.
Because our data is either 1 or 0
, the question now becomes how do we bend it. It
is very simple
. We only need to use a function called sigmoid function
in Chinese. It can be called The s-type
function can achieve the bending effect
. Taking our previous linear regression example,
our mathematical formula can be written as y=w*x+b
or if you have many features,
we can write y=W1*X1+W2* Add X2
all the way to see how many features you have. If we want to bend such a linear model
, we just need to bring it into the sigmoid function.
The sigmoid function looks like this
, it is a power of 1/1+e
It sounds like a tongue
twister, so we just need to take this linear model
and add a minus sign to bring it to the power to make a bend.
Here you may have a question
, what is the e in the English letter? What does it mean?
This e is just a constant representing 2.7182818....
Just like our mathematical pi
, it is just a constant .
The position of the square and adding a minus
sign can achieve the bending effect
. Before bringing it in,
I will first change the y here and change it to z.
OK, change it to z.
Then after bringing it in
Our new predicted value y will become like this
. Bring this string into it to see how many features you have. Here is how many times you multiply
. This is our logistic regression model .
Don’t look at him like this and feel very Scary.
In fact, there are many tools that can help us do calculations automatically
. Back to the problem we want to solve.
Today we want to use age, weight, blood sugar and gender
to predict whether we have diabetes.
So our characteristics have these 4 characteristics
That is to say, we want to find the best combination of W1234 and b
, bring it into the Sigmoid function here and bend it
, which can best represent our data
, and then we will directly implement it.
First of all, we are the same First read the data in, the action of reading
is the same as before, so I just copied it directly , but the URL of the data here is different, so I changed it
, and the URL is the same, you can find it in the course file
OK , let's execute it, and we
can see that this is our data. There are 0-399
in total , so there are 400 data
. After reading in, we first process the data. There is text
here . We need to convert the text into
Here I convert boys to 0 and girls to 1
, so let’s get its gender, which is the characteristic of gender
, and then I’ll convert it
. We can use map to convert it
, so boys correspond to 1 and girls correspond When it reaches 0
, let’s display it after converting it.
You can see that the conversion is complete
. Then we divide the data into training set and test set.
The division method is the same as before, so I just use it directly. The copy is
OK here,
just copy
it here, but the x and y here need to be changed. Our
characteristic is age and then weight... the part of y is whether there is diabetes or not. After this is done, I will display it to see
Look ,
display the x train and x test to see
OK. This is the result of the division
. Then
we also display the y part to see
the y train and y test.
OK, there is no problem.
Here we also use 0.2 for the test set, which is 2. Well,
we have a total of 400 records,
so the 2 results will be 80 records for the test set and 320 records for the training set.
Then because we have 4 features
, their distribution ranges are different . If
we want to let him in the future If the gradient descent
runs faster,
then we need to do feature scaling
. The feature scaling part is the same as before,
so I will directly use the copied
OK feature scaling here and copy it
directly .
Here, we will also use the scaling results. Display it,
and you can see that this is the result after zooming
. I also display the x test.
OK, there is no problem. After
the data processing is completed
, we can bring it into the model. First,
I set the values of w and b casually. Then
here, I first introduce numpy into it. If w, I will make it an array with 4 values in it
, because we have 4 features
. OK, there are 4 features.
W1234, I will let it be 1234
and then set another If b
b, I will let it be 1
, then our model is like this,
first let w multiply by x
, we are now in the training stage, so I multiply by x train
to display the result
, this is the result after the multiplication
, after the multiplication We want to add it,
it is the same as before
, so we use sum and let it be the part of the axis 1 to add
and then
execute , well, this is the result of the addition
, after the addition, we need to add b to it
Then this string will be the y we predicted in
the multiple linear regression before
, and now we are going to use logistic regression
, so I call this z
, and then we want to bring this z into the sigmoid function
I will first do the sigmoid function,
I call it sigmoid, and then it passes in a z
calculation method is to divide by 1 and add by 1
A certain power of an English letter e
, so you can write np.exp on the side of the English letter e
, and how many powers does he want?
It is written in the parentheses
, and what we want is the -z power
, so just put it It returns the denominator
here , and the denominator needs to be enclosed in parentheses
. Then I will import the numpy and write it here
OK to execute.
Then we will use it directly
. Let’s take this z in and have a look .
After conversion, you can see it. After this simoid function
, each value in it will be between 0 and 1.
So now we have successfully transformed this linear model
. This is originally a multiple linear regression. We
have bent this linear model.
The next thing is the same, we
just want to find the best combination of w and b
so that it can best represent our data
Loading video analysis...