Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 3 - Backprop and Neural Networks

By Stanford Online

Summary

Topics Covered

Neural Nets Demystified as Systematic Math
Context Windows Unlock NER Accuracy
Matrix Calculus Mirrors Single Variable
Shape Convention Enables SGD Updates
Backprop Reuses Derivatives Efficiently

Full Transcript

hi everyone i'll get started okay so we're now i'm back for the second week of cs224n um on natural language processing with

deep learning okay so um for today's lecture what we're going to be looking at is all the math details of doing neural net

learning first of all looking at how we can work out by hand um gradients for training neural networks and then looking at how it's done more algorithmically which is known as the

back propagation algorithm and correspondingly for you guys um well i hope you remembered that you know one minute ago was when assignment one was

due and everyone has handed that in if by some chance you haven't handed it in um really should hand it in as soon as possible best to preserve those late days for the harder assignments so i

mean i actually forgot to mention we actually did make one change um for this year to make it a bit easier when occasionally people join the class a week late if you want to this year in

the grading um assignment one can be discounted and we'll just use your other four assignments but if you've been in the class so far for that 98 percent of

people well since assignment one is the easiest assignment again it's silly not to do it and have it as part of your grade okay so

starting today we've put out assignment two and assignment two is all about making sure you really understand the math of neural networks and then the

software that we use to do that math so this is going to be a bit of a tough week for some so for some people who are great on all their math and backgrounds

um they'll feel like this is stuff they know well nothing very difficult but i know there are quite a few of you um who this lecture and week is the biggest

struggle of the course we really do want people to actually have an understanding of what goes on in your network learning rather than viewing it as some kind of deep magic

and i hope that some of the material we give today and that you read up on and use in the assignment will really give you more of a sense of what these neural

networks are doing and how it is just math that's applied in the systematic large scale that works out the answers and that this will be valuable and giving you a deeper sense of what's

going on but if this material seems very um scary and difficult you can take some refuge in the fact that there's fast light at the end of

the tunnel since this is really the only lecture that's heavily going through the math details of neural networks after that we'll be kind of popping back up to

a higher level and by and large after this week we'll be making use of software to do a lot of the complicated math for us

um but nevertheless i hope this is valuable i'll go through everything quickly today but if this isn't stuff that you know backwards i really do encourage you um

to you know work through it and get help as you need it so do come along to our office hours there are also a number of pieces of tutorial material given in the

syllabus so there's both the lecture notes there's some materials from cs231 um in the list of readings the very top

reading is uh some material put together by kevin clark a couple of years ago and actually that one's my favorite the presentation there fairly closely

follows the presentation in this lecture of going through matrix calculus so you know personally i'd recommend starting with that one but there are four different ones you can choose from if

one of them seems more helpful to you two other things on what's coming up um actually for thursday's lecture we make a big change and thursday's lecture is probably the most linguistic

lecture of the whole class where we go through the details of dependency grammar and dependency parsing some people find that tough as well but at least it'll be tough in a different way and then one other really good

opportunity is this friday we have our second tutorial at 10 a.m which is an introduction to pie torch which is the deep learning framework that we'll be using for the

rest of the class once we've gone through these first two assignments where you um do things by yourself um so this is a great chance to get intro to pytorch they'll be really useful for

later in the class okay um today's material is really all about sort of the math of neural networks but

just to sort of introduce a setting where we can work through this i'm going to introduce a simple nlp task and a simple form of classifier that we can

use for it so the task of named entity recognition is a very common basic nlp task and the goal of this is you're looking through pieces of text and

you're wanting to label by labeling the words which words belong to entity categories like persons locations

products dates times etc so for this piece of text last night paris hilton wowed in the sequin gown samuel quinn was arrested in the hilton hotel in

paris in april 1989 the the some words are being labeled as named entities as shown um these two sentences don't actually belong together in the same

article but i chose those two sentences to illustrate the basic point that it's not that you can just do this task by using a dictionary yes a dictionary is

helpful to know that paris can possibly be a location but paris can also be a person name so you have to use context to get named entity recognition right

okay well how might we do that with the neural network there are much more advanced ways of doing this but a simple yet already

pretty good way of doing um named entity recognition within a simple neural net is to say well what we're going to do is use the

word vectors that we've learned about and we're going to build up a context window of word vectors and then we're going to put those

through a neural network layer and then feed it through a softmax classifier of the kind that we um sorry i said that wrong and then we're going to feed it through a logistic

classifier of the kind that we saw when looking at negative sampling which is going to say for a particular entity type such as location

is it high probability location or is it not a high probability location so for a sentence like the museums in paris are amazing to see

what we're going to do is for each word say we're doing the word paris we're going to form a window around it say a plus or minus two word window and so for

those five words we're going to get word vectors for them from the kind of word debacle glove word vectors we've learned and we're going to make a long

vector out of the concatenation of those five word vectors so the word of interest is in the middle and then we're going to feed this

vector to a classifier which is at the end going to have a probability of the word being a location and then we could have another classifier that says the

probability of the word being a person name and so once we've done that we're then going to run it at the next position so we then say well is the word r a location and we'd feed a window of

five words as then in paris are amazing too and put it through the same kind of classifier and so this is the classifier that we'll

use so it's input will be this word window so if we have d dimensional word vectors this will be a 5d vector and then we're going to put it through a

layer of a neural network so the layer of the neural network is going to multiply this vector by a matrix add on a bias vector

and then put that through a non-linearity such as the soft max transformation that we've seen before and that will give us a hidden vector

which might be of a smaller dimensionality such as this one here and so then with that hidden vector we're then going to take the dot

product of it with an extra vector here here's u so we take u dot product h and so when we do that we're

getting out a single number and that number can be any real number and so then finally we're going to put that number through

a logistic transform of the same kind that we saw when doing negative sampling the logistic transform will take any real number and it will transform it

into a probability that that word is a location so its output is the predicted probability of the word belonging to a particular class and so this could be

our location classifier which could classify each word in a window as to what the probability is that it's a location word and so this little neural network here

is the neural network i'm going to use today when going through some of the math but actually i'm going to make it even easier on myself i'm going to throw away

the logistic function at the top and i'm really just going to work through the math of the bottom three quarters of this if you look at kevin clark's

handout that i just mentioned he includes when he works through it also working through the logistic function and we also saw working through

a softnext in the first lecture when i was working through some of the word today model okay um so the overall question we want to be

able to answer is so here's our stochastic gradient descent equation that we have existing um parameters of our model

and we want to update them based on our current loss which is at the j of theta so for getting our um loss here that the

true answer as to whether a word is a location or not will be either you know one if it is a location or zero if it isn't our logistic classifier will

return some number like um 0.9 and we'll use the distance away from what it should have been squared as our loss um so we work out a loss and then we're

moving a little distance in the negative of the gradient which will be in changing our parameter estimates in such a way that they reduce the loss

and so this is already being written in terms of a whole vector of parameters which is being updated as to a new vector of parameters but you can also

think about it that for each individual parameter theta j that we're working out the partial derivative of the loss with respect to that

parameter and then we're moving a little bit in the negative direction of that um that's going to give us a new value for

parameter theta j and we're going to update all of the parameters of our model as we learn i mean in particular in contrast to what

commonly happens in statistics we also we update not only the sort of parameters of our model that are sort of weights in the classifier

but we also will update our data representation so we'll also be changing our word vectors as we learn okay so to build neural nets i.e to

train neural nets based on data what we need is to be able to compute this gradient of the parameters so that we can then iteratively update the weights

of the model and efficiently train a model that has good weights i.e that has high accuracy and so how can we do that um well

what i'm going to talk about today is first of all um how you can do it by hand and so for doing it by hand this is

basically a review of matrix calculus and that'll take quite a bit of the lecture and then after um we've talked about that for a

while i'll then shift gears and introduce the back propagation algorithm which is the central technology for neural networks and that technology

is essentially the efficient application of calculus on a large scale as we'll come to talking about soon

so for computing gradients by hand what we're doing is matrix calculus so we're working with vectors and matrices

and working out gradients and this can seem like pretty scary stuff and well to the extent that you're kind

of scared and don't know what's going on one choice is to work out a non-vectorized gradient by just working

out what the partial derivative is for one parameter at a time and i showed a little example of that in the first lecture but it's

much much faster and more useful to actually be able to work with vectorized gradients and in some sense if you're not very

confident this is kind of almost a leap of faith but it really is the case that multi-variable calculus is just like single variable calculus except you're

using vectors and matrices so providing you remember some basics of single variable calculus you really should be able to do this stuff and get it to work out

lots of other sources i've mentioned the notes you can also look at the textbook for math 51 or which also has quite a lot of material on this

i know some of you have bad memories of math 51. um okay so let's go through

math 51. um okay so let's go through this and see how it works from ramping up from the beginning so the beginning of calculus is you know we have a

function with one input and one output f of x equals x cubed and so then its gradient is its slope right so that's

its derivative so um its derivative is three 3x squared and the way to think about this is how much will the output change if we change the input a little

bit right so what we're wanting to do in our neural net models is change what they output so that they do a better job of predicting the correct answers when

we're doing supervised learning and so what we want to know is if we fiddle different parameters of the model how much of an effect will that have on the output because then we can choose

how to fiddle them in the right way to move things down right so you know when we're saying that the derivative here is 3x squared well

what we're saying is that if you're at x equals one if you fiddle the input a little bit the output will change three times as much three times one squared

and it does so if i say what's the value at 1.01 it's about 1.03 it's changed three times as much and that's its slope

but at x equals four um the derivative is 16 times 348 so if we fiddle the input a little it'll change 48 times as much and

that's roughly what happens 4.01 cubed is is 64.48 now of course you know this is just sort of i'm showing it for a small fiddle but you know that's an

approximation to the actual truth okay so um then we sort of ramp up to the more complex cases which are more reflective of what we do with neural

networks so um if we have a function with one output and n inputs then we have a gradient so a gradient is a vector of partial derivatives with

respect to each input so we've got n inputs x1 to xn and we're working out the partial derivative of f with respect to x1 the partial derivative of f with

expected respect to x2 etc and we then get a vector of partial derivatives where each element of this vector is

just like a simple derivative with respect to one variable okay so from that point we just keep on ramping up for what we do with neural

networks so commonly when we have something like a layer in a neural network we'll have a function within inputs they'll be like our word vectors

then we do something like multiply by a matrix and then we'll have m outputs so we have a function now which is taking n inputs

and is producing m outputs so at this point um what we're calculating for the gradient is what's

called a jacobian matrix so for m inputs and n outputs the jacobian is an m by n matrix of partial of every combination

of partial derivatives so um i function f splits up into these different sub functions f1 through m

fm which generate each of the m outputs and so then we're taking the partial derivative of f1 with respect to x1 through the partial derivative of f1 with respect to xn

then heading down you know we make it up to the partial derivative of fm with respect to x1 etc so we have every possible partial derivative of an output

variable with respect to one of the input variables okay so in simple calculus

when you have a composition of one variable functions so that if you have um y equals x squared and then z equals three y

um that's then z is a composition of two functions of well you're composing two functions to get z as a function of x then you can

work out the derivative of z with respect to x and the way you do that is with the chain rule and so in the chain rule you multiply derivatives so

dz dx equals dz dy times dydx so dzy is just 3

and dydx is 2x so we get 3 times 2x so that overall um the derivative here is 6x and

since if we multiply this together we're really saying that z equals 3x squared um you should trivially be able to see again aha

its derivative is 6x so that works okay um so once we move into vectors and matrices and jacobians um it's actually

the same game so when we're working with those we can compose functions and work out their derivatives by simply multiplying jacobians so if we have

start with an input x and then put it through the simplest form of neural network layer and say that z equals wx plus b so we multiply that the

x vector by matrix w and then add on a bias vector b and then typically we'd put things through a non-linearity f so f could be a sigmoid function we'll

then say h equals f of z so this is the composition of two functions in terms of um vectors and matrices so we can use jacobians and we can say

the partial of h with respect to x is going to be the product um of the partial of h with respect to z and the

partial of zero with respect to x and this all does work out so let's start going through some examples of how these things work

slightly more concretely first just particular jacobians and then composing them together so one case we

look at is the non-linearities that we put a vector through so this is something like putting a vector through the sigmoid function f

um and so if we have an intermediate vector z and we turn into vector h by putting it through us a logistic function we can say

what is dhdz um well for this um

formally this is a function that has n inputs and n outputs so at the end of the day we're computing an n by n jacobian and

so what that's meaning is the elements of this n by n jacobian are going to take the partial derivative of each

output with respect to each input and well what is that going to be in this case well in this case because we're actually

just computing element wise a transformation such as a logistic transform of each element zi like the second equation here

if i equals j we've got something to compute whereas if i doesn't equal j um there's just

the input has no influence on the output and so the derivative of zero so if i doesn't equal j we're going to get a zero and if i does equal j what then

we're going to get the regular one variable derivative of the logistic function which if i remember correctly um you were asked to

compute now i can't remember it's assigned one or assignment two but one of the two asks you to compute it um so our jacobian for this case looks like this

we have a diagonal matrix with the um the derivatives of each element along the diagonal and everything else is zero

okay so let's look at a couple of other jacobians um so if we are asking if we've got this w x plus b basic neural network layer and we're

asking um for the gradient with respect to x then what we're going to have coming out is that that's actually going to be the

matrix w so this is where what i hope you can do is look at the notes at home and work through [Music]

this exactly and see that this is actually the right answer but this is the way in which if you just have faith and think this is a just like single

variable calculus except i've now got vectors and matrices the answer you get is actually what you expected to get because this is just like um

the derivative of ax plus b with respect to x where it's a so similarly if we take the partial derivative with respect to b

of w x plus b we get out the identity matrix okay then one other jacobian that we mentioned um while in the first lecture

while working through word to vac is if you have the dot product of two um vectors

i that's a number that what you get coming out of that it so that the partial derivative of u t h with respect to u

is h transpose and at this point there's some fine print that i'm going to come back to in a minute

so this is the correct jacobian right because in this case um we have the dimension of h inputs and we have one output

and so we want to have a row vector um but there's a little bit more to say on that that i'll come back to in about 20 slides um but this is the correct

jacobian okay so if you are not familiar with these kind of jacobians do please look at some of the

notes that are available and try and compute these in more detail element wise and convince yourself that they really are right but i'm going to assume

these now and show you what happens when we actually then work out gradients for at least a mini little neural net okay

so here is most of this um neural net i mean as i commented um that you know really we'd be working out

the partial derivative of the loss j with respect to these variables but for the example i'm doing here i just i've locked that off to keep it a little simpler and more manageable for the

lecture and so we're going to just work out the partial derivative of the score s which is a real number with respect to the different parameters of this model

where the parameters of this model are going to be the w and the b and the u and also

the input because we can update the weight vectors of the the word vectors of different words based on tuning them to better predict um the

classification outputs that we desire so let's start off with a fairly easy one where we want to update um the bias vector b

to have our system um classify better so to be able to do that what we want to work out is the partial derivatives of s with respect to b so we know how to put that into our

stochastic gradient update for the b parameters okay so how do we go about doing these things so the first step is we want to

sort of break things up into different functions of minimal complexity that compose together so in particular this neural net layer h

equals f of w x plus b it's still a little bit complex so let's decompose that one further step so um we have the input x we then

calculate the linear transformation z equals wx plus b and then um we put things through

the sort of element-wise non-linearity h equals f of z and then we do the dot product with u and you know it's useful for working

these things out you know split into pieces like this have straight what your different variables are and to know what the dimensionality of each of these

variables is it's well worth just writing out the dimensionality of every variable and making sure that the answers that you're computing are of the right dimensionality

so at this point though what we can see is that calculating s is the product of three

sorry is the composition of three functions around x so for working out the partials of s with respect to b

um it's the composition of the three functions shown on the left and so therefore the gradient of s with respect to b

we're going to take um the product of these three partial derivatives okay so how do what do we that so

um so you know we've got the s equals u t h so that's the sort of the top um corresponding partial derivative partial derivative of h with respect to z

partial derivative of z with respect um to b which is the first one that we're working out okay so we want to work this out and if

we're lucky we remember those jacobians i showed previously about the jacobian for a vector dot product the jacobian

for the non-linearity and the jacobian for the simple linear transformation and so we can use those so for

the partials of s with respect to h well that's going to be ut using the first one

the partials of h with respect to z okay so that's the non-linearity and so that's going to be the matrix that's the diagonal matrix

with the element wise derivative f prime of z and 0 elsewhere and then for the w x plus b when we're taking the partials with

respect to b that's just the identity matrix so we can simplify that down a little the identity matrix disappears and since

u t is a vector and this is a diagonal matrix we can rewrite this as ut

hadamard product of f prime of z i think this is the first time i've used this little circle for hadamard product but it it's something that you'll

see quite a bit in your network work since it's often used so when we have two vectors ut and this vector here sometimes you

want to do an element-wise product so the output of this will be a vector where you've taken the first element of each and multiplied then the second element of each and

multiplied them etc downwards and so that's called the hadamard product and it's what we're calculating as to calculate a vector

which is the gradient of s with respect to b okay so that's good so we now have a gradient of

s with respect to b and we could use that in our stochastic gradient but we don't stop there we also want to work out the gradient

with respect to others of our parameters so we might want to next go on and work out the gradient of s with respect to w

well we can use the chain rule just like we did before right so we've got the same product of functions and everything is going to be the same apart from me

now taking um the derivatives with respect to w rather than b um so it's now going to be um the

partial of s with respect to h h with respect to z and z with respect to w and the important thing to notice here

and this leads into what we do with the backpropagation algorithm is wait a minute this is very similar to what we've

already done so when we're working out the gradients of s with respect to b the first two terms were exactly the same it's only the last one that differs

so to be able to build um or to train neural networks efficiently this is what happens all the time and

it's absolutely essential that we use an algorithm that avoids repeated computation and so the idea we're going to develop is when

we have this equation stack that there's sort of stuff that's above um where we compute z and we're going to be sort of that'll be the same each time and we want to

compute something from that that we can then sort of feed downwards when working out the gradients with respect to w x or b

and so we do that by defining delta which is delta is the partials composed that are above

the linear transform and that's referred to as the local error signal it's what's being passed in from above to the linear transform and we've

already computed the gradient of that in the preceding slides and so the final form of the

partial of s with respect to b will be um delta times the remaining part and well we've

seen that you know for um partial of s with respect to b um the partial of z with respect to b is just the identity so the end result was

delta but in this time we're then going to have to work out the partial of z with respect to w and multiply that by delta so that's the part that we still

haven't yet done so um and this is where things get in some sense a little bit hairier

and so there's something that's important to explain um so you know what should we have for the jacobian of um dsdw

well that's a function that has one output the output is just a score a real number and then it has n by

m inputs so the jacobian is um a 1 by n by m matrix i a very long row vector

but um that's correct math but it turns out that that's kind of bad for our neural networks because remember what we want to do with

our neural networks is do stochastic gradient descent and we want to say theta nu equals theta old minus

a small multiplier times the gradient and well actually the w matrix is an n by m matrix

and so we couldn't actually do the subtraction if this gradient we calculate is just a huge row vector we'd like to have it as the same shape

as the w matrix in neural network land when we do this um we depart from pure math at this point and we use what we call the shape

convention so what we're going to say is um and you're meant to use this for answers in the assignment that the shape of the

gradient we're always going to make to be the shape of the parameters and so therefore um

the sdw we're also going to represent as an n by m matrix just like w and we're going to reshape

the jacobian to place it into this matrix shape okay so if we want to place it into this matrix shape

what do we what are we going to want to get for the sdw well

we know that it's going to involve delta our local error signal and then we have to work out something for

dz dw um well since c equals wx plus b you'd kind of expect that the answer should be x

um and that's right so the answer um to dsdw is um going to be

delta transpose times x transpose and so the form that we're getting for this derivative is going to be the product of

the local error signal at that's in comes from above versus what we calculate from the local input x

so that shouldn't yet be obvious why that is true so let me just go through in a bit more detail why that's true so when we want to work out um

d s d w right it's sort of delta times dz dw where um what that's computing for z is wx plus b

so let's just consider for a moment what the derivative is with respect to a single weight w i j

so w i j might be w 2 3 that's shown in my little neural network here and so the first thing to notice is that w i j

only contributes to zi so it's going into z2 which then computes h2 and it has no

effect whatsoever on h1 okay so when we're working out um dzi dw i j it's going to be

d w i x that sort of row that row of the matrix plus bi which means um that for

we've got a kind of a sum of w i k times x k and then for this sum this is like one variable calculus that when we're taking the derivative of this with respect to w

i j every term and this sum is going to be zero the derivative is going to be zero except for the one that involves w i j

and then the derivative of that is just like a x with respect to a it's going to be x so you get x j out as the answer

and so the end result of that is that when we're working out what we want is the answer is that we're going to um get

that these columns where x1 is all that's left x2 is all that's left through xm is all

that's left and then that's multiplied by the vectors of the local error signal from above and what we want to compute

is this outer product matrix where we're getting the different combinations of the delta and the x and so we can get the n by m

matrix that we'd like to have by our shape convention by taking delta transpose which is n by 1 times x

transpose which is then 1 by m and then we get this outer product matrix um so like that's the kind of a hacky argument that i've made it's certainly a way of

doing things that the dimensions work out and it sort of makes sense um there's a more detailed run through this that appears in lecture notes um and i

encourage you to sort of also look at the more matty version of that here's a little bit more information about um the shape convention so

well first of all one um more example of this so when you're working out the sdb that

that comes out as a it's jacobian is a row vector um but similarly you know according to shape convention we want

our gradient to be the same shape as b and b is a column vector so that's sort of again they're different shapes and you have to transpose one to get the other

and so effectively what we have is a disagreement between the jacobian form so the jacobian form makes sense for you know calculus and math because if

you want to have it like i claimed that matrix calculus is just like single variable calculus apart from using vectors and matrices you can just

multiply together the partials that only works out if you're using jacobians but on the other hand if you want to do stochastic gradient descent

and be able to sort of subtract off a piece of the gradient that only works if you have the same shape matrix for the

gradient as you do for the original matrix and so this is a bit confusing but that's just the reality there are both of these um

two things so the jacobian form is useful in doing the um calculus but for the answers in the assignment we

want the answers um to be presented using the shape convention so that the gradient is shown in the same shape as

the parameters and therefore you'll be able to it's the right shape for doing a gradient update by just subtracting a small amount of the gradient

so for working through things there are then basically two choices one choice is to work through all the

math using jacobians and then right at the end um to reshape following the shape convention to give the answer so

that's what i did when i worked out dsdb we worked through it using jacobians we got an answer

but it turned out to be a row vector and so well then we have to transpose it at the end to get it into the right shape for the shape convention um

the alternative is um to always follow the shape convention um and that's kind of what i did when i

was then working out dsdw i didn't fully use jacobians i said oh well when we work out whatever was dz dw let's work out what

shape we want it to be and what to fill in the cells with and if you're sort of trying to do it um immediately with the shape convention

it's a little bit more hacky in a way since you know you have to look at the dimensions for what you want and figure out when to transpose or to

reshape the matrix to be at the right shape but the kind of informal reasoning that i gave is what you do and what works and you

know one way of and there are sort of hints that you can use right that you know that your gradient should always be the same shape as your parameters and you know that the

error message coming in will always have the same dimensionality as that hidden layer and you can sort of work it out always following the shape convention

okay um so that is hey doing this is all

matrix calculus so after pausing for breath for a second the rest of the lecture

is then okay let's look at how our software trains neural networks using what's

referred to as the back propagation the back propagation algorithm um so the short answer is you know

basically we've already done it the rest of the lecture is easy um so you know essentially i've just shown you what the back propagation algorithm does

um so the back propagation algorithm is judiciously taking and propagating

derivatives using the matrix chain rule the rest of the back propagation algorithm is to say

okay when we have these neural networks we have a lot of shared structure and shared derivatives so what we want to do

is maximally efficiently reuse derivatives of higher layers when we're computing derivatives for lower layers so that we

minimize computation and i already pointed that out in the first half but we want to systematically exploit that and so the way we do that

in our computational systems is they construct computation graphs um so this maybe looks a little bit like

what you saw in a compiler's class if you did one right that you're creating i'm i call it here computation graph but it's really a tree right so you're

creating here this tree of computations in this case but in more general case it's some kind of directed graph of computations

which has source nodes which are inputs either inputs like x or input parameters like w and b

and it's interior nodes are operations and so then once we've constructed a graph and so this graph corresponds to exactly the example i did before right this was our little neural net that's in

the top right and here's the corresponding computation graph of computing w x plus b put it through the sigmoid non-linearity f

multiply the resulting dot product of the resulting vector with u gives us our output score s um okay so

what we do to compute this is we pass along the edges the results of operations so this is w x then z then h and then our output is s

and so the first thing we want to be able to do to compute with neural networks is to be able to compute for different inputs what the output is and

so that's referred to as forward propagation and so we simply run this expression much like

you'd standardly do in a compiler to compute the value of s and that's the forward propagation phase but the essential additional element of neural

networks is that we then also want to be able to send back gradients which will tell us how to update the parameters of

the model and so it's this ability um to send back gradients which gives us the ability for these models to learn once we have a loss function at the end we

can work out how to change the parameters of the model so that they more accurately produce the desired output i they

minimize the loss and so it's doing that part that then is called back propagation so we then once we forward propagated a

value with our current parameters we then um head backwards reversing the direction of the arrows and pass along gradients

down to the different parameters like b and w and u that we can use to change using stochastic gradient to send what

the value of b is of what the value of w is so we start off with dsds which is just one and then we

run our back propagation and we're using the sort of same kind of composition of jacobian so we have the sdh here and

the sdz and we progressively pass back those gradients so we just need to work out how to efficiently and cleanly do this in a

computational system and so let's sort of work through again a few of these cases so the general situation is um we have

a particular node so a node is where some kind of operation like multiplication or a non-linearity happens and so the

simplest case is that we've got one output and one input so we'll do that first so that's like h equals f of z so what we have

is an upstream gradient um dsdh and what we want to do is compute the downstream gradient of

dsdz and the way we're going to do that is say well for this function f it's a function

it's got a derivative a gradient so what we want to do is work out that local gradient dhdz and then

that gives us everything that we need to work out the sdz because that's precisely we're going to use the chain rule we're going to say the dsdz equals

the product of the sdh times the hdz where this is again using jacobians okay so the general principle that we're going to use is the downstream gradient

equals the upstream gradient times the local gradient okay sometimes it gets a little bit more complicated so we might have multiple

inputs to a function so this is the matrix vector multiply so z equals wx okay when there are multiple inputs

we still have an upstream gradient dsdz but what we're going to do is work out a local gradient with respect to each

input so we have dz dw and dzdx and so then at that point it's exactly the same for each piece of it we're going to work

out the downstream gradients the sdw and the sdx by using the chain rule with respect to the particular local gradient

so um let's go through an example of this i mean this is kind of a silly example it's not really an example that looks like a typical neural net but it's

sort of a simple example where we can show some of the components of what we do so what we're going to do is want to calculate f of x y z

which is being calculated as x plus y times the max of y and z um and we've got you know particular values that we're starting off with x

equals one y equals two and z equals zero so these are the current values of our parameters and so we can say okay well we want to

build an expression tree for that here's our expression tree we're taking x plus y we're taking the max of y and z and then we're multiplying them

and so our forward propagation phase is just to run this so we take the values of our parameters and we simply start to compute with them right so we

have one two two zero um and we add them as three the max is two we multiply them and that gives us six okay so then at that point

we then want to go and work out um how to do things um for back propagation and how these back propagation steps work

and so the first part of that is sort of working out what our local gradients are going to be um so

um so this is a here and this is x and y so d a d x since a equals x plus y is just going to be one and d a d y is also going to be one

um then um for b equals the max of y z um so this is this max node so the local

gradients for that is um it's going to depend on y b whether y is greater than z so d b d y is going to be

one if and only if y is greater than z which it is at our particular point here so that's one and db dz is going to be one only if

z is greater than y so for our particular values here that one is going to be zero

um and then finally here we're calculating the product f equals a b um so for that um

we're going to um wait sorry that slides along perfect okay so for the product um the derivative of f

with respect to a is equal to b which is two and the derivative of f with respect to b is a equals three so that gives us all of the local gradients

at each node and so then to run backpropagation we start with dfdf which is just one and then we're going to

work out the downstream equals the upstream times the local okay so the local

so when you have a product like this um note that sort of the gradients flip so we take upstream times the local which is 2

oops so the downstream is 2 on this side dfdb is three so we're taking upstream times local

that gives us three um and so that gives us back propagates values to um the plus and max nodes and so then we continue along

so for the max node um the local gradient dbdy equals one so we're going to take upstream is three so

we tend to take three times one and that gives us three d b d z is zero because of the fact that z's value is not the max um so we're

taking three times zero and saying the gradient there is zero so finally doing the plus node um the local gradients for

both x and y there are one so we're just getting two times one in both cases and we're saying that the gradients there are two

okay and so again at the end of the day um the interpretation here is that this is giving is this information as to if

we wiggle the values of x y and z how much of a difference does it make to the output what is the slope the gradient

with respect to the variable so what we've seen is that since z isn't the max of y and z

if i change the value of z a little like if i make z 0.1 or minus 0.1 it makes no difference at all to what i compute as the output so therefore the gradient

there is zero if i change the value of x a little

then that is going to have an effect and it's going to affect the output by twice as much as the amount i change it oops

right so and that's because um the dfdz equals two um so interestingly um

so i mean we can basically work that out so if we imagine um making sort of x 2.1 well then what we'd

calculate the max is to [Music] oh sorry sorry if we make x 1.1

we then get the max here is 2 and we get 1.1 plus 2 as 3.1 so we get 3.1 times

2 so that'd be about 6.2 so changing x by 0.1 has added 0.2 to the value of f

um conversely for the value of y we find that the df d y equals 5 so what we do when we've got two things coming out here as i'll go through again

in a moment is we're summing the gradients so again three plus two equals five and empirically that's what happens so if we consider fiddling the value of

y a little let's say we make it a value of 2.1 then the prediction is they'll have five times as bigger an effect on the

output value we compute and well what do we compute so we compute 1 plus 2.1 so that's 3.1 and we

compute the max of um 2.1 and 0 as 2.1 so we'll take the product of 2.1 and 3.1 and i calculate

that in advance since i can't really do this arithmetic in my head and the product of those two is 6.51 so it has gone up about by

0.5 so we've multiplied my fiddly at by 0.1 by five times to work out the magnitude of the effect of the output okay so for this stuff

you know before i did the case of you know when we had one oops one in and one out

here and multiple ends and one out here the case that i hadn't actually dealt with

is the case of when you have multiple outward branches but that then turned up in the computation of y so once you have multiple outward

branches what you're doing is your summing so that when you want to work out the

dfdy you've got a local gradient you've got two upstream gradients

and you're working it out with respect to each of them as in the chain rule and then you're summing them together to work out the impact at the end

right so we also saw some of the other node intuitions which it's useful to have um doing this so when you have an addition

um that distributes the upstream gradient to each of the things below it when you have max it's like a routing

node so when you have max you have the upstream gradient and it goes to one of the branches below it and the rest of them get no gradient

um when you then have a multiplication it has this effect of switching the gradient so if you're taking three by two

um the gradient on the two side is three and on the three side is two and if you think about in terms of how much effect you get from when you're doing this sort

of wiggling that totally makes sense right because if you're multiplying another number by three then any change here is going to be

multiplied by 3 and vice versa okay so that so this is the kind of computation graph that we want to use to work out

derivatives in an automated computational fashion um which is the basis of the back propagation algorithm but at that point that you know this is

what we're doing but there's still you know one mistake that we can make it would be wrong for us to sort of say okay well first of all we want to work out the sdb

so look we can start up here we can propagate our upstream errors work out local gradients upstream error local gradient and keep

all the way down and get the dsdb down here okay next we want to do it

for dsdw let's just run it all over again because if we did that we'd be doing repeated computation as i showed in the first half that this term

is the same both times this term is the same both times this term is the same both times that only the bits at the end differ so what we want to do

is avoid duplicated computation and compute all the gradients um that we're going to need um successively

so that we only do them once and so that was analogous when i introduced this delta variable when we computed gradients by hand so starting off here

from d um we starting off here with dsds is one

we then want to one time compute gradient in the green here one time compute the gradient in green here that's all common work

then we're going to take the local gradient um for dz db and multiply that by the upstream

gradient to work out dsdb and then we're going to take the same upstream gradient and then um work out the local gradient here

um and then sort of propagate that down to give us the sdw so the end result is we want to sort of systematically

work to forward computation forward in the graph and backward computation back propagation backward in the graph in a way that we do things

efficiently so this is the general form of the algorithm which works for an arbitrary

computation graph so at the end of the day we've got a single scalar output z

and then we have inputs and parameters which compute z and so once we have this computation

graph and i added in this funky extra arrow here to make it a more general computation graph well we can always say

that we can work out a starting point something that doesn't depend on anything so in this case both of these bottom two nodes don't depend on

anything else so we can start with them and we can start to compute forward we can compute values for all of these sort of second row from the bottom nodes and then we're

able to compute um the third lens up so we can have a topological sort of the nodes based on the dependencies in this

directed graph and we can compute the value of each node given some subset of its predecessors which it depends on and so

doing that is referred to as the forward propagation phase and gives us a computation of the scalar output z using our current parameters and our

current inputs and so then after that we run back propagation so for back propagation we initialize the output gradient

dz dz as one and then we visit nodes in the reverse order of the topological sort

and we compute the gradients downward and so our recipe is that for each node as we head down we're going to

compute the gradient of the node with respect to its successes and the things that it feeds into and how we compute that gradient is

using this chain rule that we've looked at so this is sort of the generalized form of the chain rule where we have multiple outputs and so we're summing

over the different outputs and then for each output we're computing the product of the upstream gradient and the local gradient with respect to that node and so we head

downwards and we continue down in the reverse topological sort order and we work out um the gradient with respect to each variable

in this graph and so it hopefully looks um kind of intuitive looking at this picture that

if you think of it like this the big oak complexity of forward propagation and backward propagation is the same right in both cases you're doing a linear pass

through all of these nodes and calculating values given predecessors and then values given successes i mean you have to do a little bit more work is

um for working out the gradients sort of as shown by this chain rule but it's the same big o complexity so if somehow you're implementing stuff for yourself rather than relying on the

software and you're calculating the gradiences of a different order of complexity of forward propagation it means that you're doing something wrong you're doing repeated work that you

shouldn't have to do okay so this algorithm works for a completely arbitrary computation graph any directed acyclic

graph you can apply this algorithm in general what we find is that we build neural networks that have a regular layer structure so we have things like a

vector of inputs and then that's multiplied by a matrix it's transformed into another vector which might be multiplied by another matrix or summed with another matrix or something

right so once we're using that kind of regular layer structure we can then parallelize the computation by working out the

gradients in terms of jacobians of vectors and matrices and do things in parallel much more efficiently okay

so doing this is then referred to as automatic differentiation and so essentially um if you know the computation graph

you should be able to have your compute clever computer system work out um what the derivatives of everything is and

then apply back propagation um to work out how to update the parameters and learn and there's actually a sort of an interesting um

sort of thing of how history has gone backwards here which i'll just note um so some of you might be

um familiar with symbolic um computation packages so those are things like mathematica so mathematica you can

give it a symbolic form of a computation and then it can work out derivatives for you so it should be the case that if you give a complete symbolic form of a

computation graph um then it should be able to work out all the derivatives for you and you never have to work out a derivative by hand whatsoever

and that was actually attempted in a famous um deep learning library called fianno which came out of joshua bendio's group at the university of montreal

that it had a compiler that did that kind of symbolic manipulation um but you know somehow that sort of proved um

a little bit too too hard a road to follow i imagine it actually might come back again in the future and so for

modern deep learning frameworks which includes both tensorflow or pi torch they do 90 percent

of um this computation of automatic differentiation for you but they don't actually symbolically compute derivatives so for each

particular node or layer of your deep learning system somebody either you or the person who wrote that

layer has hand-written the local derivatives but then everything from that point on the sort of the taking doing the chain rule of combining

upstream gradients with local gradients to work out downstream gradients that's then all being done automatically for back propagation on the computation graph

and so that what that means is for a whole neural network you have a computation graph and it's going to have a forward pass and a backward pass

and so for the forward pass you're topologically sorting the nodes based on their dependencies in the computation graph and then for each

node you're running forward the forward computation on that node and then for backward propagation you are reversing the topological sort of the

graph and then for each node in the graph you're running the backward propagation which is the little bit of backdrop the chain rule at that node and

then the result of doing that is you have gradients for your inputs and parameters and so

this is the overall software runs this for you and so what you want to do is then actually have stuff

for particular nodes or layers in the graph so if i have a multiply gate it's going to have a forward algorithm which just computes that the

output is x times y in terms of the two inputs and then i'm going to want to compute to tell it also how to calculate the local derivative so i want to say

what is the local derivative so dl dx and the ldy in terms of the upstream gradient dldz and so

i will then manually work out how to calculate that and normally what i have to do is i assume the forward pass is being run

first and i'm going to shove into some local variables for my class the values that were used in the forward computation so as well as computing z

equals x times y i'm going to sort of remember what x and y were so that then when i'm asked to compute

the backward pass i'm then going to have implemented here um what we saw earlier of um that when it's x y you're going to

sort of swap the y and the x um to work out the local gradients and so then i'm going to multiply those by the upstream gradient and i'm going to return

i've just written it here as a sort of a little list but really it's going to be a numpy vector of the gradients okay um

so that's um 98 of what i wanted to cover um today just um a couple of quick comments um left so

um that can and should all be automated sometimes you want to just check if you're computing the right gradients and so the standard way of checking that you're

computing the right gradients is to manually work out the gradient by doing a numeric calculation of the gradient and so

um you can do that so you can work out what the derivative of x of f with respect to x should be by choosing some sort of small number

like 10 to the minus 4 adding it to x subtracting it from x and then so the difference between these numbers is 2h dividing it through by 2h

and you're simply working out the rise over the run which is the slope of that point with respect to x and that's an approximation of the gradient of f with

respect to x at that value of x so this is so simple you can't make a mistake implementing that and so therefore you can use this

to check um where your whether your gradient values are correct or not um this isn't something that you'd want to use much um because not only is it

approximate but it's extremely slow because to work this out you have to run the forward computation for every parameter of the model so if you have a model with a million parameters you're

now doing a million times as much work to run back prop as as you would do if you're actually using calculus so calculus is a good thing to

know but it can be really useful to check that the right values are being calculated in the old days when we hand wrote everything this was kind of the key unit test that

people used everywhere these days most of the time you're reusing layers that are built into pie torch or some other deep learning framework so it's much less

needed but sometimes you're implementing your own layer and you really do want to check that things are implemented correctly there's a fine point in the way this is

written if you saw this in sort of high school calculus class you would have seen rise over run of f

of x plus h minus f of x divided by h it turns out that doing this two-sided estimate like this is much much more

accurate than doing a one-sided estimate and so you're really much encouraged to use this approximation okay so at that point um we've mastered

the core technology of neural nets um backpropagation is recursively and hence efficiently applying the chain rule along the computation graph with this

sort of key step that downstream gradient equals upstream abstract the upstream gradient times local gradient and so for calculating with neural nets we do

the forward pass to work out values with current parameters then run back propagation and work out the gradient

of the loss and currently computed loss with respect to those parameters now to some extent um you know with modern deep learning frameworks you don't actually have to

know how to do any of this right it's like it's the same as you don't have to know how to implement a c compiler you can just write c code and say gcc and

it'll compile it and it'll run um the right stuff for you um and that's the kind of functionality you get from the pytorch framework so do come along to

the pie torch tutorial this friday and get a sense about how easy it is to write new networks um using a framework like pytorch or tensorflow and you know

it's so easy that's why you know high school students across the nation are now doing their science projects training deep learning systems because you don't actually have to

understand very much debunk a few neural network layers together and set it computing on some data but you know we hope in this class that you actually are also learning how these

things are implemented um so you have a deeper understanding of than that and you know it turns out that sometimes you need to have a deeper

understanding so back propagation doesn't always work carefully perfectly and so understanding what it's really doing can be crucial to debugging things

and so we'll actually see an example of that fairly soon when we start looking at recurrent models and some of the problems that they have which will require us to think a bit more deeply

about what's happening in our gradient computations okay that's it for today

Loading...

Loading video analysis...