Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 3 - Backprop and Neural Networks
By Stanford Online
Summary
Topics Covered
- Neural Nets Demystified as Systematic Math
- Context Windows Unlock NER Accuracy
- Matrix Calculus Mirrors Single Variable
- Shape Convention Enables SGD Updates
- Backprop Reuses Derivatives Efficiently
Full Transcript
hi everyone i'll get started okay so we're now i'm back for the second week of cs224n um on natural language processing with
deep learning okay so um for today's lecture what we're going to be looking at is all the math details of doing neural net
learning first of all looking at how we can work out by hand um gradients for training neural networks and then looking at how it's done more algorithmically which is known as the
back propagation algorithm and correspondingly for you guys um well i hope you remembered that you know one minute ago was when assignment one was
due and everyone has handed that in if by some chance you haven't handed it in um really should hand it in as soon as possible best to preserve those late days for the harder assignments so i
mean i actually forgot to mention we actually did make one change um for this year to make it a bit easier when occasionally people join the class a week late if you want to this year in
the grading um assignment one can be discounted and we'll just use your other four assignments but if you've been in the class so far for that 98 percent of
people well since assignment one is the easiest assignment again it's silly not to do it and have it as part of your grade okay so
starting today we've put out assignment two and assignment two is all about making sure you really understand the math of neural networks and then the
software that we use to do that math so this is going to be a bit of a tough week for some so for some people who are great on all their math and backgrounds
um they'll feel like this is stuff they know well nothing very difficult but i know there are quite a few of you um who this lecture and week is the biggest
struggle of the course we really do want people to actually have an understanding of what goes on in your network learning rather than viewing it as some kind of deep magic
and i hope that some of the material we give today and that you read up on and use in the assignment will really give you more of a sense of what these neural
networks are doing and how it is just math that's applied in the systematic large scale that works out the answers and that this will be valuable and giving you a deeper sense of what's
going on but if this material seems very um scary and difficult you can take some refuge in the fact that there's fast light at the end of
the tunnel since this is really the only lecture that's heavily going through the math details of neural networks after that we'll be kind of popping back up to
a higher level and by and large after this week we'll be making use of software to do a lot of the complicated math for us
um but nevertheless i hope this is valuable i'll go through everything quickly today but if this isn't stuff that you know backwards i really do encourage you um
to you know work through it and get help as you need it so do come along to our office hours there are also a number of pieces of tutorial material given in the
syllabus so there's both the lecture notes there's some materials from cs231 um in the list of readings the very top
reading is uh some material put together by kevin clark a couple of years ago and actually that one's my favorite the presentation there fairly closely
follows the presentation in this lecture of going through matrix calculus so you know personally i'd recommend starting with that one but there are four different ones you can choose from if
one of them seems more helpful to you two other things on what's coming up um actually for thursday's lecture we make a big change and thursday's lecture is probably the most linguistic
lecture of the whole class where we go through the details of dependency grammar and dependency parsing some people find that tough as well but at least it'll be tough in a different way and then one other really good
opportunity is this friday we have our second tutorial at 10 a.m which is an introduction to pie torch which is the deep learning framework that we'll be using for the
rest of the class once we've gone through these first two assignments where you um do things by yourself um so this is a great chance to get intro to pytorch they'll be really useful for
later in the class okay um today's material is really all about sort of the math of neural networks but
just to sort of introduce a setting where we can work through this i'm going to introduce a simple nlp task and a simple form of classifier that we can
use for it so the task of named entity recognition is a very common basic nlp task and the goal of this is you're looking through pieces of text and
you're wanting to label by labeling the words which words belong to entity categories like persons locations
products dates times etc so for this piece of text last night paris hilton wowed in the sequin gown samuel quinn was arrested in the hilton hotel in
paris in april 1989 the the some words are being labeled as named entities as shown um these two sentences don't actually belong together in the same
article but i chose those two sentences to illustrate the basic point that it's not that you can just do this task by using a dictionary yes a dictionary is
helpful to know that paris can possibly be a location but paris can also be a person name so you have to use context to get named entity recognition right
okay well how might we do that with the neural network there are much more advanced ways of doing this but a simple yet already
pretty good way of doing um named entity recognition within a simple neural net is to say well what we're going to do is use the
word vectors that we've learned about and we're going to build up a context window of word vectors and then we're going to put those
through a neural network layer and then feed it through a softmax classifier of the kind that we um sorry i said that wrong and then we're going to feed it through a logistic
classifier of the kind that we saw when looking at negative sampling which is going to say for a particular entity type such as location
is it high probability location or is it not a high probability location so for a sentence like the museums in paris are amazing to see
what we're going to do is for each word say we're doing the word paris we're going to form a window around it say a plus or minus two word window and so for
those five words we're going to get word vectors for them from the kind of word debacle glove word vectors we've learned and we're going to make a long
vector out of the concatenation of those five word vectors so the word of interest is in the middle and then we're going to feed this
vector to a classifier which is at the end going to have a probability of the word being a location and then we could have another classifier that says the
probability of the word being a person name and so once we've done that we're then going to run it at the next position so we then say well is the word r a location and we'd feed a window of
five words as then in paris are amazing too and put it through the same kind of classifier and so this is the classifier that we'll
use so it's input will be this word window so if we have d dimensional word vectors this will be a 5d vector and then we're going to put it through a
layer of a neural network so the layer of the neural network is going to multiply this vector by a matrix add on a bias vector
and then put that through a non-linearity such as the soft max transformation that we've seen before and that will give us a hidden vector
which might be of a smaller dimensionality such as this one here and so then with that hidden vector we're then going to take the dot
product of it with an extra vector here here's u so we take u dot product h and so when we do that we're
getting out a single number and that number can be any real number and so then finally we're going to put that number through
a logistic transform of the same kind that we saw when doing negative sampling the logistic transform will take any real number and it will transform it
into a probability that that word is a location so its output is the predicted probability of the word belonging to a particular class and so this could be
our location classifier which could classify each word in a window as to what the probability is that it's a location word and so this little neural network here
is the neural network i'm going to use today when going through some of the math but actually i'm going to make it even easier on myself i'm going to throw away
the logistic function at the top and i'm really just going to work through the math of the bottom three quarters of this if you look at kevin clark's
handout that i just mentioned he includes when he works through it also working through the logistic function and we also saw working through
a softnext in the first lecture when i was working through some of the word today model okay um so the overall question we want to be
able to answer is so here's our stochastic gradient descent equation that we have existing um parameters of our model
and we want to update them based on our current loss which is at the j of theta so for getting our um loss here that the
true answer as to whether a word is a location or not will be either you know one if it is a location or zero if it isn't our logistic classifier will
return some number like um 0.9 and we'll use the distance away from what it should have been squared as our loss um so we work out a loss and then we're
moving a little distance in the negative of the gradient which will be in changing our parameter estimates in such a way that they reduce the loss
and so this is already being written in terms of a whole vector of parameters which is being updated as to a new vector of parameters but you can also
think about it that for each individual parameter theta j that we're working out the partial derivative of the loss with respect to that
parameter and then we're moving a little bit in the negative direction of that um that's going to give us a new value for
parameter theta j and we're going to update all of the parameters of our model as we learn i mean in particular in contrast to what
commonly happens in statistics we also we update not only the sort of parameters of our model that are sort of weights in the classifier
but we also will update our data representation so we'll also be changing our word vectors as we learn okay so to build neural nets i.e to
train neural nets based on data what we need is to be able to compute this gradient of the parameters so that we can then iteratively update the weights
of the model and efficiently train a model that has good weights i.e that has high accuracy and so how can we do that um well
what i'm going to talk about today is first of all um how you can do it by hand and so for doing it by hand this is
basically a review of matrix calculus and that'll take quite a bit of the lecture and then after um we've talked about that for a
while i'll then shift gears and introduce the back propagation algorithm which is the central technology for neural networks and that technology
is essentially the efficient application of calculus on a large scale as we'll come to talking about soon
so for computing gradients by hand what we're doing is matrix calculus so we're working with vectors and matrices
and working out gradients and this can seem like pretty scary stuff and well to the extent that you're kind
of scared and don't know what's going on one choice is to work out a non-vectorized gradient by just working
out what the partial derivative is for one parameter at a time and i showed a little example of that in the first lecture but it's
much much faster and more useful to actually be able to work with vectorized gradients and in some sense if you're not very
confident this is kind of almost a leap of faith but it really is the case that multi-variable calculus is just like single variable calculus except you're
using vectors and matrices so providing you remember some basics of single variable calculus you really should be able to do this stuff and get it to work out
lots of other sources i've mentioned the notes you can also look at the textbook for math 51 or which also has quite a lot of material on this
i know some of you have bad memories of math 51. um okay so let's go through
math 51. um okay so let's go through this and see how it works from ramping up from the beginning so the beginning of calculus is you know we have a
function with one input and one output f of x equals x cubed and so then its gradient is its slope right so that's
its derivative so um its derivative is three 3x squared and the way to think about this is how much will the output change if we change the input a little
bit right so what we're wanting to do in our neural net models is change what they output so that they do a better job of predicting the correct answers when
we're doing supervised learning and so what we want to know is if we fiddle different parameters of the model how much of an effect will that have on the output because then we can choose
how to fiddle them in the right way to move things down right so you know when we're saying that the derivative here is 3x squared well
what we're saying is that if you're at x equals one if you fiddle the input a little bit the output will change three times as much three times one squared
and it does so if i say what's the value at 1.01 it's about 1.03 it's changed three times as much and that's its slope
but at x equals four um the derivative is 16 times 348 so if we fiddle the input a little it'll change 48 times as much and
that's roughly what happens 4.01 cubed is is 64.48 now of course you know this is just sort of i'm showing it for a small fiddle but you know that's an
approximation to the actual truth okay so um then we sort of ramp up to the more complex cases which are more reflective of what we do with neural
networks so um if we have a function with one output and n inputs then we have a gradient so a gradient is a vector of partial derivatives with
respect to each input so we've got n inputs x1 to xn and we're working out the partial derivative of f with respect to x1 the partial derivative of f with
expected respect to x2 etc and we then get a vector of partial derivatives where each element of this vector is
just like a simple derivative with respect to one variable okay so from that point we just keep on ramping up for what we do with neural
networks so commonly when we have something like a layer in a neural network we'll have a function within inputs they'll be like our word vectors
then we do something like multiply by a matrix and then we'll have m outputs so we have a function now which is taking n inputs
and is producing m outputs so at this point um what we're calculating for the gradient is what's
called a jacobian matrix so for m inputs and n outputs the jacobian is an m by n matrix of partial of every combination
of partial derivatives so um i function f splits up into these different sub functions f1 through m
fm which generate each of the m outputs and so then we're taking the partial derivative of f1 with respect to x1 through the partial derivative of f1 with respect to xn
then heading down you know we make it up to the partial derivative of fm with respect to x1 etc so we have every possible partial derivative of an output
variable with respect to one of the input variables okay so in simple calculus
when you have a composition of one variable functions so that if you have um y equals x squared and then z equals three y
um that's then z is a composition of two functions of well you're composing two functions to get z as a function of x then you can
work out the derivative of z with respect to x and the way you do that is with the chain rule and so in the chain rule you multiply derivatives so
dz dx equals dz dy times dydx so dzy is just 3
and dydx is 2x so we get 3 times 2x so that overall um the derivative here is 6x and
since if we multiply this together we're really saying that z equals 3x squared um you should trivially be able to see again aha
its derivative is 6x so that works okay um so once we move into vectors and matrices and jacobians um it's actually
the same game so when we're working with those we can compose functions and work out their derivatives by simply multiplying jacobians so if we have
start with an input x and then put it through the simplest form of neural network layer and say that z equals wx plus b so we multiply that the
x vector by matrix w and then add on a bias vector b and then typically we'd put things through a non-linearity f so f could be a sigmoid function we'll
then say h equals f of z so this is the composition of two functions in terms of um vectors and matrices so we can use jacobians and we can say
the partial of h with respect to x is going to be the product um of the partial of h with respect to z and the
partial of zero with respect to x and this all does work out so let's start going through some examples of how these things work
slightly more concretely first just particular jacobians and then composing them together so one case we
look at is the non-linearities that we put a vector through so this is something like putting a vector through the sigmoid function f
um and so if we have an intermediate vector z and we turn into vector h by putting it through us a logistic function we can say
what is dhdz um well for this um
formally this is a function that has n inputs and n outputs so at the end of the day we're computing an n by n jacobian and
so what that's meaning is the elements of this n by n jacobian are going to take the partial derivative of each
output with respect to each input and well what is that going to be in this case well in this case because we're actually
just computing element wise a transformation such as a logistic transform of each element zi like the second equation here
if i equals j we've got something to compute whereas if i doesn't equal j um there's just
the input has no influence on the output and so the derivative of zero so if i doesn't equal j we're going to get a zero and if i does equal j what then
we're going to get the regular one variable derivative of the logistic function which if i remember correctly um you were asked to
compute now i can't remember it's assigned one or assignment two but one of the two asks you to compute it um so our jacobian for this case looks like this
we have a diagonal matrix with the um the derivatives of each element along the diagonal and everything else is zero
okay so let's look at a couple of other jacobians um so if we are asking if we've got this w x plus b basic neural network layer and we're
asking um for the gradient with respect to x then what we're going to have coming out is that that's actually going to be the
matrix w so this is where what i hope you can do is look at the notes at home and work through [Music]
this exactly and see that this is actually the right answer but this is the way in which if you just have faith and think this is a just like single
variable calculus except i've now got vectors and matrices the answer you get is actually what you expected to get because this is just like um
the derivative of ax plus b with respect to x where it's a so similarly if we take the partial derivative with respect to b
of w x plus b we get out the identity matrix okay then one other jacobian that we mentioned um while in the first lecture
while working through word to vac is if you have the dot product of two um vectors
i that's a number that what you get coming out of that it so that the partial derivative of u t h with respect to u
is h transpose and at this point there's some fine print that i'm going to come back to in a minute
so this is the correct jacobian right because in this case um we have the dimension of h inputs and we have one output
and so we want to have a row vector um but there's a little bit more to say on that that i'll come back to in about 20 slides um but this is the correct
jacobian okay so if you are not familiar with these kind of jacobians do please look at some of the
notes that are available and try and compute these in more detail element wise and convince yourself that they really are right but i'm going to assume
these now and show you what happens when we actually then work out gradients for at least a mini little neural net okay
so here is most of this um neural net i mean as i commented um that you know really we'd be working out
the partial derivative of the loss j with respect to these variables but for the example i'm doing here i just i've locked that off to keep it a little simpler and more manageable for the
lecture and so we're going to just work out the partial derivative of the score s which is a real number with respect to the different parameters of this model
where the parameters of this model are going to be the w and the b and the u and also
the input because we can update the weight vectors of the the word vectors of different words based on tuning them to better predict um the
classification outputs that we desire so let's start off with a fairly easy one where we want to update um the bias vector b
to have our system um classify better so to be able to do that what we want to work out is the partial derivatives of s with respect to b so we know how to put that into our
stochastic gradient update for the b parameters okay so how do we go about doing these things so the first step is we want to
sort of break things up into different functions of minimal complexity that compose together so in particular this neural net layer h
equals f of w x plus b it's still a little bit complex so let's decompose that one further step so um we have the input x we then
calculate the linear transformation z equals wx plus b and then um we put things through
the sort of element-wise non-linearity h equals f of z and then we do the dot product with u and you know it's useful for working
these things out you know split into pieces like this have straight what your different variables are and to know what the dimensionality of each of these
variables is it's well worth just writing out the dimensionality of every variable and making sure that the answers that you're computing are of the right dimensionality
so at this point though what we can see is that calculating s is the product of three
sorry is the composition of three functions around x so for working out the partials of s with respect to b
um it's the composition of the three functions shown on the left and so therefore the gradient of s with respect to b
we're going to take um the product of these three partial derivatives okay so how do what do we that so
um so you know we've got the s equals u t h so that's the sort of the top um corresponding partial derivative partial derivative of h with respect to z
partial derivative of z with respect um to b which is the first one that we're working out okay so we want to work this out and if
we're lucky we remember those jacobians i showed previously about the jacobian for a vector dot product the jacobian
for the non-linearity and the jacobian for the simple linear transformation and so we can use those so for
the partials of s with respect to h well that's going to be ut using the first one
the partials of h with respect to z okay so that's the non-linearity and so that's going to be the matrix that's the diagonal matrix
with the element wise derivative f prime of z and 0 elsewhere and then for the w x plus b when we're taking the partials with
respect to b that's just the identity matrix so we can simplify that down a little the identity matrix disappears and since
u t is a vector and this is a diagonal matrix we can rewrite this as ut
hadamard product of f prime of z i think this is the first time i've used this little circle for hadamard product but it it's something that you'll
see quite a bit in your network work since it's often used so when we have two vectors ut and this vector here sometimes you
want to do an element-wise product so the output of this will be a vector where you've taken the first element of each and multiplied then the second element of each and
multiplied them etc downwards and so that's called the hadamard product and it's what we're calculating as to calculate a vector
which is the gradient of s with respect to b okay so that's good so we now have a gradient of
s with respect to b and we could use that in our stochastic gradient but we don't stop there we also want to work out the gradient
with respect to others of our parameters so we might want to next go on and work out the gradient of s with respect to w
well we can use the chain rule just like we did before right so we've got the same product of functions and everything is going to be the same apart from me
now taking um the derivatives with respect to w rather than b um so it's now going to be um the
partial of s with respect to h h with respect to z and z with respect to w and the important thing to notice here
and this leads into what we do with the backpropagation algorithm is wait a minute this is very similar to what we've
already done so when we're working out the gradients of s with respect to b the first two terms were exactly the same it's only the last one that differs
so to be able to build um or to train neural networks efficiently this is what happens all the time and
it's absolutely essential that we use an algorithm that avoids repeated computation and so the idea we're going to develop is when
we have this equation stack that there's sort of stuff that's above um where we compute z and we're going to be sort of that'll be the same each time and we want to
compute something from that that we can then sort of feed downwards when working out the gradients with respect to w x or b
and so we do that by defining delta which is delta is the partials composed that are above
the linear transform and that's referred to as the local error signal it's what's being passed in from above to the linear transform and we've
already computed the gradient of that in the preceding slides and so the final form of the
partial of s with respect to b will be um delta times the remaining part and well we've
seen that you know for um partial of s with respect to b um the partial of z with respect to b is just the identity so the end result was
delta but in this time we're then going to have to work out the partial of z with respect to w and multiply that by delta so that's the part that we still
haven't yet done so um and this is where things get in some sense a little bit hairier
and so there's something that's important to explain um so you know what should we have for the jacobian of um dsdw
well that's a function that has one output the output is just a score a real number and then it has n by
m inputs so the jacobian is um a 1 by n by m matrix i a very long row vector
but um that's correct math but it turns out that that's kind of bad for our neural networks because remember what we want to do with
our neural networks is do stochastic gradient descent and we want to say theta nu equals theta old minus
a small multiplier times the gradient and well actually the w matrix is an n by m matrix
and so we couldn't actually do the subtraction if this gradient we calculate is just a huge row vector we'd like to have it as the same shape
as the w matrix in neural network land when we do this um we depart from pure math at this point and we use what we call the shape
convention so what we're going to say is um and you're meant to use this for answers in the assignment that the shape of the
gradient we're always going to make to be the shape of the parameters and so therefore um
the sdw we're also going to represent as an n by m matrix just like w and we're going to reshape
the jacobian to place it into this matrix shape okay so if we want to place it into this matrix shape
what do we what are we going to want to get for the sdw well
we know that it's going to involve delta our local error signal and then we have to work out something for
dz dw um well since c equals wx plus b you'd kind of expect that the answer should be x
um and that's right so the answer um to dsdw is um going to be
delta transpose times x transpose and so the form that we're getting for this derivative is going to be the product of
the local error signal at that's in comes from above versus what we calculate from the local input x
so that shouldn't yet be obvious why that is true so let me just go through in a bit more detail why that's true so when we want to work out um
d s d w right it's sort of delta times dz dw where um what that's computing for z is wx plus b
so let's just consider for a moment what the derivative is with respect to a single weight w i j
so w i j might be w 2 3 that's shown in my little neural network here and so the first thing to notice is that w i j
only contributes to zi so it's going into z2 which then computes h2 and it has no
effect whatsoever on h1 okay so when we're working out um dzi dw i j it's going to be
d w i x that sort of row that row of the matrix plus bi which means um that for
we've got a kind of a sum of w i k times x k and then for this sum this is like one variable calculus that when we're taking the derivative of this with respect to w
i j every term and this sum is going to be zero the derivative is going to be zero except for the one that involves w i j
and then the derivative of that is just like a x with respect to a it's going to be x so you get x j out as the answer
and so the end result of that is that when we're working out what we want is the answer is that we're going to um get
that these columns where x1 is all that's left x2 is all that's left through xm is all
that's left and then that's multiplied by the vectors of the local error signal from above and what we want to compute
is this outer product matrix where we're getting the different combinations of the delta and the x and so we can get the n by m
matrix that we'd like to have by our shape convention by taking delta transpose which is n by 1 times x
transpose which is then 1 by m and then we get this outer product matrix um so like that's the kind of a hacky argument that i've made it's certainly a way of
doing things that the dimensions work out and it sort of makes sense um there's a more detailed run through this that appears in lecture notes um and i
encourage you to sort of also look at the more matty version of that here's a little bit more information about um the shape convention so
well first of all one um more example of this so when you're working out the sdb that
that comes out as a it's jacobian is a row vector um but similarly you know according to shape convention we want
our gradient to be the same shape as b and b is a column vector so that's sort of again they're different shapes and you have to transpose one to get the other
and so effectively what we have is a disagreement between the jacobian form so the jacobian form makes sense for you know calculus and math because if
you want to have it like i claimed that matrix calculus is just like single variable calculus apart from using vectors and matrices you can just
multiply together the partials that only works out if you're using jacobians but on the other hand if you want to do stochastic gradient descent
and be able to sort of subtract off a piece of the gradient that only works if you have the same shape matrix for the
gradient as you do for the original matrix and so this is a bit confusing but that's just the reality there are both of these um
two things so the jacobian form is useful in doing the um calculus but for the answers in the assignment we
want the answers um to be presented using the shape convention so that the gradient is shown in the same shape as
the parameters and therefore you'll be able to it's the right shape for doing a gradient update by just subtracting a small amount of the gradient
so for working through things there are then basically two choices one choice is to work through all the
math using jacobians and then right at the end um to reshape following the shape convention to give the answer so
that's what i did when i worked out dsdb we worked through it using jacobians we got an answer
but it turned out to be a row vector and so well then we have to transpose it at the end to get it into the right shape for the shape convention um
the alternative is um to always follow the shape convention um and that's kind of what i did when i
was then working out dsdw i didn't fully use jacobians i said oh well when we work out whatever was dz dw let's work out what
shape we want it to be and what to fill in the cells with and if you're sort of trying to do it um immediately with the shape convention
it's a little bit more hacky in a way since you know you have to look at the dimensions for what you want and figure out when to transpose or to
reshape the matrix to be at the right shape but the kind of informal reasoning that i gave is what you do and what works and you
know one way of and there are sort of hints that you can use right that you know that your gradient should always be the same shape as your parameters and you know that the
error message coming in will always have the same dimensionality as that hidden layer and you can sort of work it out always following the shape convention
okay um so that is hey doing this is all
matrix calculus so after pausing for breath for a second the rest of the lecture
is then okay let's look at how our software trains neural networks using what's
referred to as the back propagation the back propagation algorithm um so the short answer is you know
basically we've already done it the rest of the lecture is easy um so you know essentially i've just shown you what the back propagation algorithm does
um so the back propagation algorithm is judiciously taking and propagating
derivatives using the matrix chain rule the rest of the back propagation algorithm is to say
okay when we have these neural networks we have a lot of shared structure and shared derivatives so what we want to do
is maximally efficiently reuse derivatives of higher layers when we're computing derivatives for lower layers so that we
minimize computation and i already pointed that out in the first half but we want to systematically exploit that and so the way we do that
in our computational systems is they construct computation graphs um so this maybe looks a little bit like
what you saw in a compiler's class if you did one right that you're creating i'm i call it here computation graph but it's really a tree right so you're
creating here this tree of computations in this case but in more general case it's some kind of directed graph of computations
which has source nodes which are inputs either inputs like x or input parameters like w and b
and it's interior nodes are operations and so then once we've constructed a graph and so this graph corresponds to exactly the example i did before right this was our little neural net that's in
the top right and here's the corresponding computation graph of computing w x plus b put it through the sigmoid non-linearity f
multiply the resulting dot product of the resulting vector with u gives us our output score s um okay so
what we do to compute this is we pass along the edges the results of operations so this is w x then z then h and then our output is s
and so the first thing we want to be able to do to compute with neural networks is to be able to compute for different inputs what the output is and
so that's referred to as forward propagation and so we simply run this expression much like
you'd standardly do in a compiler to compute the value of s and that's the forward propagation phase but the essential additional element of neural
networks is that we then also want to be able to send back gradients which will tell us how to update the parameters of
the model and so it's this ability um to send back gradients which gives us the ability for these models to learn once we have a loss function at the end we
can work out how to change the parameters of the model so that they more accurately produce the desired output i they
minimize the loss and so it's doing that part that then is called back propagation so we then once we forward propagated a
value with our current parameters we then um head backwards reversing the direction of the arrows and pass along gradients
down to the different parameters like b and w and u that we can use to change using stochastic gradient to send what
the value of b is of what the value of w is so we start off with dsds which is just one and then we
run our back propagation and we're using the sort of same kind of composition of jacobian so we have the sdh here and
the sdz and we progressively pass back those gradients so we just need to work out how to efficiently and cleanly do this in a
computational system and so let's sort of work through again a few of these cases so the general situation is um we have
a particular node so a node is where some kind of operation like multiplication or a non-linearity happens and so the
simplest case is that we've got one output and one input so we'll do that first so that's like h equals f of z so what we have
is an upstream gradient um dsdh and what we want to do is compute the downstream gradient of
dsdz and the way we're going to do that is say well for this function f it's a function
it's got a derivative a gradient so what we want to do is work out that local gradient dhdz and then
that gives us everything that we need to work out the sdz because that's precisely we're going to use the chain rule we're going to say the dsdz equals
the product of the sdh times the hdz where this is again using jacobians okay so the general principle that we're going to use is the downstream gradient
equals the upstream gradient times the local gradient okay sometimes it gets a little bit more complicated so we might have multiple
inputs to a function so this is the matrix vector multiply so z equals wx okay when there are multiple inputs
we still have an upstream gradient dsdz but what we're going to do is work out a local gradient with respect to each
input so we have dz dw and dzdx and so then at that point it's exactly the same for each piece of it we're going to work
out the downstream gradients the sdw and the sdx by using the chain rule with respect to the particular local gradient
so um let's go through an example of this i mean this is kind of a silly example it's not really an example that looks like a typical neural net but it's
sort of a simple example where we can show some of the components of what we do so what we're going to do is want to calculate f of x y z
which is being calculated as x plus y times the max of y and z um and we've got you know particular values that we're starting off with x
equals one y equals two and z equals zero so these are the current values of our parameters and so we can say okay well we want to
build an expression tree for that here's our expression tree we're taking x plus y we're taking the max of y and z and then we're multiplying them
and so our forward propagation phase is just to run this so we take the values of our parameters and we simply start to compute with them right so we
have one two two zero um and we add them as three the max is two we multiply them and that gives us six okay so then at that point
we then want to go and work out um how to do things um for back propagation and how these back propagation steps work
and so the first part of that is sort of working out what our local gradients are going to be um so
um so this is a here and this is x and y so d a d x since a equals x plus y is just going to be one and d a d y is also going to be one
um then um for b equals the max of y z um so this is this max node so the local
gradients for that is um it's going to depend on y b whether y is greater than z so d b d y is going to be
one if and only if y is greater than z which it is at our particular point here so that's one and db dz is going to be one only if
z is greater than y so for our particular values here that one is going to be zero
um and then finally here we're calculating the product f equals a b um so for that um
we're going to um wait sorry that slides along perfect okay so for the product um the derivative of f
with respect to a is equal to b which is two and the derivative of f with respect to b is a equals three so that gives us all of the local gradients
at each node and so then to run backpropagation we start with dfdf which is just one and then we're going to
work out the downstream equals the upstream times the local okay so the local
so when you have a product like this um note that sort of the gradients flip so we take upstream times the local which is 2
oops so the downstream is 2 on this side dfdb is three so we're taking upstream times local
that gives us three um and so that gives us back propagates values to um the plus and max nodes and so then we continue along
so for the max node um the local gradient dbdy equals one so we're going to take upstream is three so
we tend to take three times one and that gives us three d b d z is zero because of the fact that z's value is not the max um so we're
taking three times zero and saying the gradient there is zero so finally doing the plus node um the local gradients for
both x and y there are one so we're just getting two times one in both cases and we're saying that the gradients there are two
okay and so again at the end of the day um the interpretation here is that this is giving is this information as to if
we wiggle the values of x y and z how much of a difference does it make to the output what is the slope the gradient
with respect to the variable so what we've seen is that since z isn't the max of y and z
if i change the value of z a little like if i make z 0.1 or minus 0.1 it makes no difference at all to what i compute as the output so therefore the gradient
there is zero if i change the value of x a little
then that is going to have an effect and it's going to affect the output by twice as much as the amount i change it oops
right so and that's because um the dfdz equals two um so interestingly um
so i mean we can basically work that out so if we imagine um making sort of x 2.1 well then what we'd
calculate the max is to [Music] oh sorry sorry if we make x 1.1
we then get the max here is 2 and we get 1.1 plus 2 as 3.1 so we get 3.1 times
2 so that'd be about 6.2 so changing x by 0.1 has added 0.2 to the value of f
um conversely for the value of y we find that the df d y equals 5 so what we do when we've got two things coming out here as i'll go through again
in a moment is we're summing the gradients so again three plus two equals five and empirically that's what happens so if we consider fiddling the value of
y a little let's say we make it a value of 2.1 then the prediction is they'll have five times as bigger an effect on the
output value we compute and well what do we compute so we compute 1 plus 2.1 so that's 3.1 and we
compute the max of um 2.1 and 0 as 2.1 so we'll take the product of 2.1 and 3.1 and i calculate
that in advance since i can't really do this arithmetic in my head and the product of those two is 6.51 so it has gone up about by
0.5 so we've multiplied my fiddly at by 0.1 by five times to work out the magnitude of the effect of the output okay so for this stuff
you know before i did the case of you know when we had one oops one in and one out
here and multiple ends and one out here the case that i hadn't actually dealt with
is the case of when you have multiple outward branches but that then turned up in the computation of y so once you have multiple outward
branches what you're doing is your summing so that when you want to work out the
dfdy you've got a local gradient you've got two upstream gradients
and you're working it out with respect to each of them as in the chain rule and then you're summing them together to work out the impact at the end
right so we also saw some of the other node intuitions which it's useful to have um doing this so when you have an addition
um that distributes the upstream gradient to each of the things below it when you have max it's like a routing
node so when you have max you have the upstream gradient and it goes to one of the branches below it and the rest of them get no gradient
um when you then have a multiplication it has this effect of switching the gradient so if you're taking three by two
um the gradient on the two side is three and on the three side is two and if you think about in terms of how much effect you get from when you're doing this sort
of wiggling that totally makes sense right because if you're multiplying another number by three then any change here is going to be
multiplied by 3 and vice versa okay so that so this is the kind of computation graph that we want to use to work out
derivatives in an automated computational fashion um which is the basis of the back propagation algorithm but at that point that you know this is
what we're doing but there's still you know one mistake that we can make it would be wrong for us to sort of say okay well first of all we want to work out the sdb
so look we can start up here we can propagate our upstream errors work out local gradients upstream error local gradient and keep
all the way down and get the dsdb down here okay next we want to do it
for dsdw let's just run it all over again because if we did that we'd be doing repeated computation as i showed in the first half that this term
is the same both times this term is the same both times this term is the same both times that only the bits at the end differ so what we want to do
is avoid duplicated computation and compute all the gradients um that we're going to need um successively
so that we only do them once and so that was analogous when i introduced this delta variable when we computed gradients by hand so starting off here
from d um we starting off here with dsds is one
we then want to one time compute gradient in the green here one time compute the gradient in green here that's all common work
then we're going to take the local gradient um for dz db and multiply that by the upstream
gradient to work out dsdb and then we're going to take the same upstream gradient and then um work out the local gradient here
um and then sort of propagate that down to give us the sdw so the end result is we want to sort of systematically
work to forward computation forward in the graph and backward computation back propagation backward in the graph in a way that we do things
efficiently so this is the general form of the algorithm which works for an arbitrary
computation graph so at the end of the day we've got a single scalar output z
and then we have inputs and parameters which compute z and so once we have this computation
graph and i added in this funky extra arrow here to make it a more general computation graph well we can always say
that we can work out a starting point something that doesn't depend on anything so in this case both of these bottom two nodes don't depend on
anything else so we can start with them and we can start to compute forward we can compute values for all of these sort of second row from the bottom nodes and then we're
able to compute um the third lens up so we can have a topological sort of the nodes based on the dependencies in this
directed graph and we can compute the value of each node given some subset of its predecessors which it depends on and so
doing that is referred to as the forward propagation phase and gives us a computation of the scalar output z using our current parameters and our
current inputs and so then after that we run back propagation so for back propagation we initialize the output gradient
dz dz as one and then we visit nodes in the reverse order of the topological sort
and we compute the gradients downward and so our recipe is that for each node as we head down we're going to
compute the gradient of the node with respect to its successes and the things that it feeds into and how we compute that gradient is
using this chain rule that we've looked at so this is sort of the generalized form of the chain rule where we have multiple outputs and so we're summing
over the different outputs and then for each output we're computing the product of the upstream gradient and the local gradient with respect to that node and so we head
downwards and we continue down in the reverse topological sort order and we work out um the gradient with respect to each variable
in this graph and so it hopefully looks um kind of intuitive looking at this picture that
if you think of it like this the big oak complexity of forward propagation and backward propagation is the same right in both cases you're doing a linear pass
through all of these nodes and calculating values given predecessors and then values given successes i mean you have to do a little bit more work is
um for working out the gradients sort of as shown by this chain rule but it's the same big o complexity so if somehow you're implementing stuff for yourself rather than relying on the
software and you're calculating the gradiences of a different order of complexity of forward propagation it means that you're doing something wrong you're doing repeated work that you
shouldn't have to do okay so this algorithm works for a completely arbitrary computation graph any directed acyclic
graph you can apply this algorithm in general what we find is that we build neural networks that have a regular layer structure so we have things like a
vector of inputs and then that's multiplied by a matrix it's transformed into another vector which might be multiplied by another matrix or summed with another matrix or something
right so once we're using that kind of regular layer structure we can then parallelize the computation by working out the
gradients in terms of jacobians of vectors and matrices and do things in parallel much more efficiently okay
so doing this is then referred to as automatic differentiation and so essentially um if you know the computation graph
you should be able to have your compute clever computer system work out um what the derivatives of everything is and
then apply back propagation um to work out how to update the parameters and learn and there's actually a sort of an interesting um
sort of thing of how history has gone backwards here which i'll just note um so some of you might be
um familiar with symbolic um computation packages so those are things like mathematica so mathematica you can
give it a symbolic form of a computation and then it can work out derivatives for you so it should be the case that if you give a complete symbolic form of a
computation graph um then it should be able to work out all the derivatives for you and you never have to work out a derivative by hand whatsoever
and that was actually attempted in a famous um deep learning library called fianno which came out of joshua bendio's group at the university of montreal
that it had a compiler that did that kind of symbolic manipulation um but you know somehow that sort of proved um
a little bit too too hard a road to follow i imagine it actually might come back again in the future and so for
modern deep learning frameworks which includes both tensorflow or pi torch they do 90 percent
of um this computation of automatic differentiation for you but they don't actually symbolically compute derivatives so for each
particular node or layer of your deep learning system somebody either you or the person who wrote that
layer has hand-written the local derivatives but then everything from that point on the sort of the taking doing the chain rule of combining
upstream gradients with local gradients to work out downstream gradients that's then all being done automatically for back propagation on the computation graph
and so that what that means is for a whole neural network you have a computation graph and it's going to have a forward pass and a backward pass
and so for the forward pass you're topologically sorting the nodes based on their dependencies in the computation graph and then for each
node you're running forward the forward computation on that node and then for backward propagation you are reversing the topological sort of the
graph and then for each node in the graph you're running the backward propagation which is the little bit of backdrop the chain rule at that node and
then the result of doing that is you have gradients for your inputs and parameters and so
this is the overall software runs this for you and so what you want to do is then actually have stuff
for particular nodes or layers in the graph so if i have a multiply gate it's going to have a forward algorithm which just computes that the
output is x times y in terms of the two inputs and then i'm going to want to compute to tell it also how to calculate the local derivative so i want to say
what is the local derivative so dl dx and the ldy in terms of the upstream gradient dldz and so
i will then manually work out how to calculate that and normally what i have to do is i assume the forward pass is being run
first and i'm going to shove into some local variables for my class the values that were used in the forward computation so as well as computing z
equals x times y i'm going to sort of remember what x and y were so that then when i'm asked to compute
the backward pass i'm then going to have implemented here um what we saw earlier of um that when it's x y you're going to
sort of swap the y and the x um to work out the local gradients and so then i'm going to multiply those by the upstream gradient and i'm going to return
i've just written it here as a sort of a little list but really it's going to be a numpy vector of the gradients okay um
so that's um 98 of what i wanted to cover um today just um a couple of quick comments um left so
um that can and should all be automated sometimes you want to just check if you're computing the right gradients and so the standard way of checking that you're
computing the right gradients is to manually work out the gradient by doing a numeric calculation of the gradient and so
um you can do that so you can work out what the derivative of x of f with respect to x should be by choosing some sort of small number
like 10 to the minus 4 adding it to x subtracting it from x and then so the difference between these numbers is 2h dividing it through by 2h
and you're simply working out the rise over the run which is the slope of that point with respect to x and that's an approximation of the gradient of f with
respect to x at that value of x so this is so simple you can't make a mistake implementing that and so therefore you can use this
to check um where your whether your gradient values are correct or not um this isn't something that you'd want to use much um because not only is it
approximate but it's extremely slow because to work this out you have to run the forward computation for every parameter of the model so if you have a model with a million parameters you're
now doing a million times as much work to run back prop as as you would do if you're actually using calculus so calculus is a good thing to
know but it can be really useful to check that the right values are being calculated in the old days when we hand wrote everything this was kind of the key unit test that
people used everywhere these days most of the time you're reusing layers that are built into pie torch or some other deep learning framework so it's much less
needed but sometimes you're implementing your own layer and you really do want to check that things are implemented correctly there's a fine point in the way this is
written if you saw this in sort of high school calculus class you would have seen rise over run of f
of x plus h minus f of x divided by h it turns out that doing this two-sided estimate like this is much much more
accurate than doing a one-sided estimate and so you're really much encouraged to use this approximation okay so at that point um we've mastered
the core technology of neural nets um backpropagation is recursively and hence efficiently applying the chain rule along the computation graph with this
sort of key step that downstream gradient equals upstream abstract the upstream gradient times local gradient and so for calculating with neural nets we do
the forward pass to work out values with current parameters then run back propagation and work out the gradient
of the loss and currently computed loss with respect to those parameters now to some extent um you know with modern deep learning frameworks you don't actually have to
know how to do any of this right it's like it's the same as you don't have to know how to implement a c compiler you can just write c code and say gcc and
it'll compile it and it'll run um the right stuff for you um and that's the kind of functionality you get from the pytorch framework so do come along to
the pie torch tutorial this friday and get a sense about how easy it is to write new networks um using a framework like pytorch or tensorflow and you know
it's so easy that's why you know high school students across the nation are now doing their science projects training deep learning systems because you don't actually have to
understand very much debunk a few neural network layers together and set it computing on some data but you know we hope in this class that you actually are also learning how these
things are implemented um so you have a deeper understanding of than that and you know it turns out that sometimes you need to have a deeper
understanding so back propagation doesn't always work carefully perfectly and so understanding what it's really doing can be crucial to debugging things
and so we'll actually see an example of that fairly soon when we start looking at recurrent models and some of the problems that they have which will require us to think a bit more deeply
about what's happening in our gradient computations okay that's it for today
Loading video analysis...