The Complete Mathematics of Neural Networks and Deep Learning
By Adam Dhalla
Summary
Topics Covered
- Neural Networks Are Just Complex Calculus Problems
- Gradients vs. Jacobians: Mapping Vectors to Scalars vs. Vectors
- The Neuron: A Vector-to-Scalar Function
- Gradient Descent: Navigating the Cost Landscape
- Backpropagation: Efficiently Calculating Weight Gradients
Full Transcript
[Music] hi there um i'm adam dalla over the course of the next
however long this video ends up being i'm going to be explaining the mathematics behind neural networks uh specifically artificial neural networks feed forward neuron whatever you want to call them the spec the standard uh neural network uh and
specifically focusing on back propagation and how the mathematics and calculus works in back propagation so i just want to cover a few big picture introductory prerequisite things
and then we'll jump right into that so let's get to it transition all right let's talk about the prerequisites for this lecture um very important so let's start from the beginning uh something really
distressed is this is not an introduction um if you're looking for an introduction introduction to neural networks um and how they work and kind of a high level explanation there's plenty of amazing resources for
that sort of thing i'd recommend andrea's course but there's just hundreds of videos three blue one brown has a great video series but this is not introduction this is for people who want to know a little bit more about the mathematics that's
happening beneath the surface second all is needed is basic linear algebra nothing too complex no matrix factorizations or eigenvector eigenvalue stuff it's just
um it's just transposing matrix multiplying matrices adding matrices dot products vectors kind of the simple you know just you know nothing too complex um and then
multivariable calculus is probably the majority of what this is going to be um that's kind of the basis for all of back propagation you should be pretty um familiar and and kind of you
know comfortable with single variable calculus and you should be familiar in some sense with multi-variable calculus um specifically differential calculus there's not going to be any integral calculus in this video um
and then as well as you should be you should at least kind of know know what jacobians and gradients are and how they work in differential uh functions in multivariable calculus and just kind of be familiar with how
that works and probably derived a few equations and you know yeah and then base ml knowledge i'm trying to keep this short because i've done already a take of this and i talk way too much
um so base ml knowledge just a base email knowledge of um what a cost function is gradient descents in general just knowing how machines learn so those are the prerequisites if you're still here um we're gonna have a lot of
fun so we're gonna jump right into the agenda and what we're gonna cover next so transition all right so this is going to be the agenda for what we're going to be covering in this lecture this is
obviously a very abbreviated version and we're going to be covering much more than this but this is just so you can kind of get a good idea of what's going to be happening so first we'll be starting with a big picture of neural networks and back propagation
uh of course this is not an introductory course but just kind of talking about what the big goals are here and what we're trying to do with back propagation so we can keep that in mind when we're talking about the more you know nitty-gritty we always have an idea
of what we're trying to do next we're going to be talking about just a quick review in multi-variable calculus and differential calculus talking about jacobians and gradients and how they come about from just trying to
derive functions with multiple variables and just a note i'll be saying matrix calculus and multivariable calculus sort of interchangeably uh in this lecture anyways
so next we have a view i'm going to just kind of explore this idea of seeing a single neuron in a neural network as a function that takes in a scalar and it takes in a vector and outputs a scalar so that's kind of an important
idea and then next we're going to be talking about jacobians and how they work out in neural networks we're going to be looking at a very simple neural network and seeing how that works next we're going to be seeing how those
gradients calculated in this help us optimize and use gradient descent i'll be talking quickly about stochastic gradient descent and a few other things but you know we're not going to be spending too much time
and then here's the big monster we're going to be talking a lot about back propagation um firstly just with scalars and vectors and kind of just kind of keeping it with notation
but then we're going to be jumping into what those actual uh derivations and um matrices and you know jacobians look like when we're dealing with you know big time back propagation and i just put
this here so i don't forget all sources that i'll be talking about in this video um will be linked in the description and then there's a few important ones that i'll mention um but in general yeah lots of different places anyways so
let's just jump right into it um first off we're gonna have a big picture and talking about some notation and uh yeah all right see you see you in the next clip all right so just some quick stuff to
get out of the way um we're going to be just quickly talking about the notation we're going to use most of it's pretty standard so if you've if you're in machine learning already you probably get most of this but just a quick review
and there's going to be much more notation coming on later but i'll explain that when we get to it this is just kind of the important preliminary stuff so m is going to be the size of our training set so it's going to be the amount of training examples we have
um n is going to be the amount of input variables we have um so each training example will have n different input variables uh we'll often say x
one x two and then pop up up up x n right to each one of these is one of our variables of our input variables um next uh l capital l is the amount of layers in our neural
network um next the lowercase l is regarding a specific layer so it's sometimes in the superscript or subscript of another term um for example weights
so this is the weights of um layer l uh it's a bit complicated i'm going to kind of think about the notation here but um w with superscript l um
with l is the layer the weights are going into um so you can think about it here so if this was on layer one and this was layer two and these weights these weights would have the superscript
two because they're going into uh layer two and similarly the bias for layer l is going to be if the bias was here it'd be
be2 because this is the second layer lastly x with subscript i to denote which variable it is
is a single input variable and again each of our training examples um has one of each uh input variable we have so that's all
we need to know for uh notation so we'll continue on all right all right let's get into the real stuff
so this is the big picture and these are just a few things that we want to keep in mind when we're going through the nitty-gritty and the math so the first thing you want to think about is neural networks as functions um a lot of people think of them like
this but it's important to keep in mind that a neural network is just a big fancy function made up of a lot of little functions um and all the parameters to this function are the weights and the biases however many
that may be and the inputs to this function is usually a vector of all the variables of one training example so you have this you have this beginning of a neural network here
like this and these all go to whatever layer and then the input is some x1 x2 x3 x4 blah blah blah blah for xn um and then the output
can be a lot of different things depending on the neural network often if you're thinking of something like image classification it might be a vector of probabilities and then you choose the biggest
probability for your classification or something like that um but the output is often a scalar or a decision or a yes or no or something like that so it's basically
turning these raw numbers into something understandable secondly just to raise that make some space secondly we want to think of this as
just one big calculus problem and that's really all it is we're trying to minimize the cost function um so you might see terminator or something like that
all those are are just calculus um so really just one big minimization problem in calculus you're trying to take the cost and trying to minimize that so you're doing
that through very very very uh complicated a huge chain rule operation pretty much with respect to all however many of these parameters you
may have as i was saying this is all about the cost function and minimizing that cost function uh more specifically it's all about finding the derivative of the cost function
with respect to every single weight and bias so that's basically just think about it like this so if you have a function with say a thousand parameters a thousand weights and biases
um to best figure out how to lower that cost function per se so get down here is to find out how much each weight and bias contributes to the cost so then you can
take so then you can appropriately add or subtract the different weights to make sure you're finding your optimal algorithm so basically that's why we're trying to do all this and that's why
the entire goal of back propagation is to find the derivative of the cost with respect to every different weight and bias so basically how the cost changes when we change a weight or a
bias in our algorithm and this can help us kind of tweak them kind of learn about how each one impacts the final cost and yeah lastly just to keep in mind how exciting this all is
um really deep learning has become so much bigger in just the past 10 years and although this stuff was actually invented in the 1960s and 70s um back propagation as it was thought
was actually news 1986 but anyways it's some older math but this is being applied in very novel ways and it's not a super big stretch from this to the um to the cutting edge so
it's quite interesting and i hope you'll be enthusiastic about all the math we're learning so yeah so after that i'll just hop right into um the real math so yeah see you there
all right so let's just jump right in with some matrix calculus starting with gradients which is how we display the partial derivatives of a function that takes something from being a vector
to being a scalar so let's see what i mean by that let's use an example so let's say f of x y so that's two input variables equals x squared what plus
cosine of y it's getting closer perfect so let's take the partial derivatives so with respect to both of the variables
so this and then x is going to be 2x and then we're going to do this with respect to y which is going to be minus sine y so those are two partial derivatives
that wasn't too tough so let's explore what i meant by taking something from being a vector to a scalar so let's take this vector um i mean
let's take this function and uh so this might not look like a vector but we can display it as one so we can say x and y so that can be our input and then we put it through f of x y
which i guess you can see is a vector function and then it returns some sort of scalar so let's just to make a concrete let's put some numbers in there so let's say x equals two
and y equals zero so let's plug those in there so that would be 2 and 0 instead right so 2 squared is going to be 4 and then cosine of 0 is going to be 1
so we're going to add that so it'd be 4 plus 1.
it's going to kind of look like an s but it's going to be 5 now so that's our vector which is an r2 to our scalar which is an
r1 as so that's here all right so that's that's going to be a that kind of function anyways so a a gradient so a gradient is how we can
display the partial derivatives of this function or any function like this that takes a vector to a scalar so a gradient it might sound kind of complicated but all it is is just a vector
of the partial derivative so in this case so well in any case i'd be this and this so it'd be the partial derivative with respect to x and the partial derivative
with respect to y so that would go on for how many so if you had a thousand uh input variables there would be a thousand uh items in this vector with respect to each
input variable uh but now we just have two so of course in our case this uh this um gradient would equal 2x and negative sine of y
so this would be described as the gradient of f of x f x y um and just some notation for gradients that we're going to be using in this
uh that's pretty common so a gradient is um given by this upside-down triangle sign i think it's a delta but don't this is a delta i put in anything um yeah oh no it's a nabla sorry it's a
nabla right so we have this upside down triangle just called an upside down triangle we don't have time for this elitism uh and then we have an upside-down triangle and then we have uh the
name of this function so f and then sometimes you can do x y so that's the gradient of f so this is the sign for a gradient so whenever you see this think of something that looks like this
partial derivatives all right so we got that down it's not too great next let's talk about jacobians so jacobians instead of taking a vector to a scalar
take a vector to a vector that might be of same or different shape so uh yeah so vectors all right so i mean jacobians so let's take an
example again so let's take another function that takes in two variables um x and y but instead of returning a scalar let's make them return a vector so let's
make what what are these two functions when you want them to be um maybe 2 x plus
y to the power of 3 and then maybe e u to the power of y plus 13 let's make that negative 30 negative 13x
so here we go that's our that's our uh function that returns that gives an an r2 input and a r2 output so taking the partial derivatives of this
is a little more complicated because now we have more than one function technically so how we're going to break this up is we're going to make it into basically two scalar functions
so we're going to say f1 is equal to the first row and f 2 is equal to the second row not too hard so now we can just take the partial derivatives
of each of these functions with respect to the variables just like we did before but now we're kind of doing them separately so let's take this first function for us
f1 so let's make a space for f so let's take the derivative of f1 with respect to x so all our input variables
so that's going to be equal to so let's look at our first function f1 look at x that's going to be a 2 so it's going to be 2. let's take
f1 and then let's take it with respect to y the other variable and that's going to be y so we need the power rule equals 3 y squared so now we have these two derivatives but then we also have this second function
here f2 some expect for that so now we have f2 with respect to x and we're going to take that so e y so actually notice if i for some reason i
switch them around probably don't want to switch them around but uh whatever so we're going to have 13 so it's going to be negative 13.
um and then we're going to have f2 again and um with respect to y so that's going to be e to the power of y awesome so now we have these four
partial derivatives and see that now that we have two and two we're going to have four partial derivatives that we have to calculate so already see that we're getting more and more partial derivatives but similar to the gradient how we're
going to calculate this jacobian is going to be how we kind of assemble these partial derivatives so let's make the jacobian here
so all we do is this is actually a very similar arrangement to the jacobian so before we put um all we did is we put the uh we put the gradient right so that was
the x and y with regressive functions that would be with with respect to x and with respect to y so now we have four so now what do we do here so what we do is we make we have two lines in this
so these are going to be the derivatives with respect to the first function and these are the ones going to be respect to the second function and obviously with the more lines our output has the more lines this is going to have
so one way you can think about it is that each of these lines are going to have the uh the derivatives with respect to x and y so it would be the first function with
respect to x first source first function with respect to y second function with respect to x and second function with respect to y so you can almost think of it as
each line of this jacobian is a gradient of each function so this would be the gradient of f1 and this would be the gradient of f2 and technically it's
uh the transpose of those gradients because before it has vertical grains now they're ordered so that's not too hard but let's actually fill this in just for completion
so just from here that would be the you get it i'm just gonna concentrate on not screwing this up um so it's gonna be f1 and then respect to y and then f2 with
respect to x and then f to get this back to x so it's going to be a two by two jacobian and then the rows as i said are the functions and
these are the variables the columns are the variables so let's just fill those in for enough so our jacobian is going to look like 2 and then 3y squared and then our bottom row would be
negative 13 and then e to the power of y so that's our jacobian for this function all right so hopefully got that all right so next up let's talk about
jacobians which uh are basically how we display the partial derivatives of the function that takes a vector to another vector so let's do an example again so let's say f x y this
also takes in two variables but this time we got a vector function that returns a vector so what do we want this to be and what do you say maybe something interesting
2x minus y to the power of 3 and maybe e to the power of x plus maybe negative 13
uh one so now we have a r2 scale a vector going to an r2 vector uh these can be different you could have an r3 vector and have
another thing but we're going with an r2 vector okay so first of all let's see there's so now we're gonna so how do we take the partial derivative of this
that's a good question um we can't do it similar to the old thing so what our natural instinct to do is split up these into two scalar functions so we'll have some f1 equals 2x minus y3
that's the first row and then have at f2 which is e x minus 13y so that's the second row all right so now we can take the derivatives of x with respect to x and y
of both of these functions so you'll see what i mean so let's take the partial derivative of f1 with respect to x first so that's going to be here so that's going to be 2 equals 2. and
then let's take the partial derivative of f1 again but this time to y so this one is going to be negative 3 y squared using the power rule
and then so now let's so we got all the partial derivatives our first function so now we want to worry about our second function so f2 so our partial derivatives of f2 we're going to do the exact same thing with respect to x so now it's going to
be e x so we're going to have e x and then we're going to have f of 2 but this time to y and we're going to have that equal to equal to
what's what is our thing so negative 13.
so those are going to be our partial derivatives all four of our partial derivatives um so we need two for each function so basically the jacobian is how we display these partial derivatives
uh not dissimilar to how the gradient worked um so remember how the gradient was vertical a vector uh where you had the derivative with respect to x and then derivative with respect to y well now that we have four we're going
to have a matrix and that's going to be called the jacobian so let's just look at it here so before i draw it let's just kind of see what we're going to be expecting so it's going to be similar to this in terms of orientation so we're going to
have all the derivatives with respect to f1 here and then we're going to have all the derivatives with respect to f2 and obviously if you have more rows to this vector
um you're going to have more output equations and this is going to go on for however long um but yeah the rows are going to be the functions and then the columns are going to be the variables so
you can think of it almost as the gradient of f1 and then the gradient of f2 as the rows um you can see why but uh obviously i guess to be exact you want to be
that's the transpose of the gradient because you're taking a column vector turning into a row vector anyway so let's just kind of fill in what this would be for us so just to kind of make it concrete let me draw those partial derivatives as
variables so it would be f1 and then f sorry gives me x and then f1 with respect to y and then it's going to
be f2 here with respect to x again so the same column same variable and f2 with respect to y uh that's that looks right so it's going to be the functions and then
and then the variables so if you had more variables it would be longer this way if you had more functions it'd be longer this way so in our case specifically
we would have two then we'd have a negative three y squared and then we'd have see all those uh e to the power of x and
then we would have negative 13. nice so that's going to be
negative 13. nice so that's going to be our jacobian i think that's all you need to say for the jacobian so that's going to be however big you want a good thing to note is a lot of the time and i'll explain this later
but these will end up being diagonal vectors with just zeros along the edges which makes computation a lot quicker but i'll explain why that happens and when that happens anyways
so next on let's move on to the jacobian chain rule so you've already seen the scalar chain rule i hope um and are familiar with how to uh differentiate a function using the
scalar chain rule but i just want to kind of introduce a kind of a new way of looking at the scalar channel that will help us differentiate larger functions and get comfortable with jacobians and that sort of thing
so the jacobian chain rule and that sort of thing so scalar chain rule so if we have some function that we can do uh the chain rule on so sine of x squared normally what we would do we do the derivative of the inner function times the derivative of the outer function so
we get 2x cosine of x squared but there's kind of this almost a slower way with smaller functions but kind of methodical
algorithmic way that we can kind of apply to any function and that's splitting this up into two functions so what we can do is we can split this up into two functions so we have
uh we can you know this inner function x squared will set the equal to g and then let's set this outer function to f sine of g so if we plug that in
it's the exact same thing here so what we do is we find the derivative of g so to find what we're trying to find is the derivative of f with respect to x right
uh we're trying to find how this changes when you change x so usually we did this in that kind of very quick way but let's kind of do this methodically so what we do is we find
how uh how f changes when we change g and then we see how g changes when we see x so this is really what we're doing with the chain rule we're kind of breaking this up in two functions and finding the
derivatives of both of them so this is just an interesting way to look at it so let's just kind of carry through so the par partial derivative with how j g changes with respect to x is going to be you
know that's going to be 2x so it's going to be 2x um this will be 2x and then how f change with respect to g is going to be cosine of g so the change in f with respect to
change in x is going to be equal to cosine g times 2x but then our last step is to substitute so we substitute any intermediate functions that we created with uh whatever it was equal to so it would be cosine
of g so cosine of x squared and then 2x so that's the exact same answer i think i raised my original answer but the exact same answer that we get by doing this kind of intermediate function kind of way of seeing things
so when we have a very nested function where we have some g of f of x of c of a of x and i'll have an x here h why not we have some super nested
function here what we can do is we can set all of these inner functions to um intermediate uh kind of function variables and then find the derivatives of all those
and then multiply and then substitute at the end so that's just kind of a new way of seeing uh or you might be familiar but kind of a more algorithmic view of the scalar chain rule that'll work for more um
for a lot of different functions so just like assign all the inner functions with some variable and then find all the things and then you know do that so this will help us look at the
jacobian chain rule which we'll look at next all right so how can this idea help us deal with vector to vector functions well looking at this example here sine of x squared plus y
and a long of y cubed we can see that both of these are functions that would generally require the chain rule if you were dealing with them by themselves of course here you would compute the outside so it would be the derivative of
the outsides of the cosine of you know you know the cosine of x squared plus y multiply the derivative of the inside so derivative outside by derivative inside
so we do something similar here um that we did with our example before so we can set an intermediate variable for these insides and turn this actually into a vector of intermediate functions
so we can say x squared plus y and y cubed so those are the two intermediate functions from these two all right so now we can set an outer function vector
so we can say sine of g1 and lon of g2 perfect so obviously that makes sense because this is g1 similar to what we were saying f1
and f2 before this is g1 this is g2 so if we substitute these into here we get back this exact same vector okay well great so what do we do from here so
what we can actually do is compute the jacobians of both of these let's make a little space for that here all right so let's compute the jacobian of g and i kind of get used to this notation
so the jacobian of this is going to be how g changes when we change x and y so to be specific that's how g1 and g2 change when we change
x and y so how we can represent this is the change in g which is a vector now of g1 and g2 when we change x which now is a vector of x and y
so that's constant unity notation that's important so obviously we can take this first function here g1 and see how that changes with x so that's going to be 2x see how it changes with y which is going to be one
see how this changes with x it's going to be zero because there's no x in this row and how it changes with one which is going to be three y squared awesome okay so now let's do the exact
same thing for the second function uh the second vector so now we're going to be finding the change in f uh when we change
g okay so now we're going to be doing another jacobian of so instead of our variables before being x and y
now our variables variables will be g1 and g2 because those are our inputs to this vector okay so let's take sine of g1 with respect to g1 it's going to be cosine
of g1 uh g1 respect sine of g1 respect to g2 is going to be zero uh lawn of g1 is going to be zero
bond of g2 with respect to g1 is going to be zero and log of g2 of respect to g2 is going to be one over g2 right so now that we have these two
jacobians what can we do with them well you might remember with the scalar chain rule you should keep that up here with the scalar chain rule remember
scalar chain rule remember the scalar chain rule we could find the derivative of some um let's say y with respect to some
x if we have the derivative of that y to some intermediate u and that intermediate u was some x you can almost think of them crossing
out and we can have a y respect to an x so keeping that in mind we can actually do the exact same thing with the vector channel with jacobians well now that we're familiar with this
notation we can see how that comes about pretty easily so right now we're trying to find the change in the f vector which is this when we change the x vector that's going to be the jacobian of this
so that's going to be the jacobian of this multiplied by this and with matrix multiplication it's really important to get the order correct because if you multiply some matrix
a times some matrix b that's not equal to a matrix b times a matrix a so we need to be really careful about our ordering so what we do with the jacobians when
we're multiplying jacobians for the jacobian chain rule we want to start with the outer the outer functions first so we're going to find the change in
forgot my partial sign the change in f when we change our g and then how
our g changes when we change our x so obviously this uh jacobian here is this and this jacobian here is this so we can just directly
substitute those jacobians in and see what we get and try to keep this small because i always run out of space all right so 2x one zero
and three y squared so yeah i'm gonna run around our room again but i'll kind of erase these because we got these now um and then we're gonna multiply this by um
sorry i actually did exactly what i was not supposed to do i was switch around the orders of these so we actually that's an example of what you should be doing with the order you should be starting with the outside one so that'd be cosine
of g1 and then zero and then zero and then one over g2 and then we're going to be putting putting the intermediate function at the end so we're going to be putting
2x 1 0 and 3y squared all right so now we just can do a simple matrix multiplication of these two to get our answer so what we're going to do is we're going to keep this here so
we're still trying to find the change in f when we change x and we're just going to do simple matrix multiplication here so we're going to do this row times this we're going to get 2x cosine
of g1 and we're going to do this times this that's going to be cosine of g1 and then we're going to do this times this so that's going to be zero i'm going to do this times this which is
going to be 3 y squared over g two and i like to just kind of do a mental check of this does this make sense when looking at
this um always i'm kind of suspicious of zeros but zeros are a bit weird so i wanna make sure that this zero makes sense so why is there a zero here so what does this mean so this this column is the change in x so
why doesn't the changing of x affect the second function well just looking at it there's no x here so of course there's not going to be a change in this when we change x but it's completely not related to x
um so we can kind of assume that i did these all right i'm pretty sure i did hopefully it'd be it'd be awkward if i didn't um and the last step here is going to be substituting these g1s
with the actual intermediate functions so let's go ahead and do that so it's kind of confusing but this is going to be our final answer here all right so our change in f when we
change x is going to be exactly equal to um 2 x cosine of x squared plus y
at cosine of x squared plus one which is our g1 kind of running our room there hopefully you get that and zero and three y squared
over our g2 which is y3 you can actually simplify that into y and three over y okay
so great so this jacobian here is the change in our f when we change our input vector x so this is our final answer that we got through
using the chain rule with jacobians and there's just one thing i want to mention that i was kind of confused about when i first learned this and i kind of had to figure it out myself because there wasn't
many resources on it was what happens when not all of these vectors let me just make this really clear because we're done with this now
so one thing i was kind of confused on is what if not all of these uh functions what if they weren't all kind of chain rule-able or whatever like what if you could not use the chain rule of them
so let's say we had another one x y three so you wouldn't be using the chain rule on something like this so what would that intermediate function that intermediate vector look like
well all it would look like is you would have g sine so then you'd have your normal x square x squared plus y and then you
have y cubed just like before but now instead of um there's no intermediate function for this function here so i would just put one so that's just something to keep in mind when you're multiplying out these
jacobians when you have um a function that doesn't have an intermediate function and you're computing because that's going to happen a lot you're not always going to have chain functions that use the chain rule
just leave a one here and that makes all the multiplication afterwards safe so that's all i want to mention there um that's really all you need to know about the jacobian chain rule
and and all basically that's basically the extent of uh calculus need to know for this but i'll explain the derivatives of some special um operations in the jacobians of some
special operations like the hadamard had what was called the product the product of two vectors and stuff like that um uh in in the coming parts but that's all you need to know for preliminary
knowledge alright so i'll see in the next part all right so let's just jump straight in with uh neural networks and neurons uh
this is exciting this is our first actual neural network stuff so let's get started um so first of all i'm gonna describe how we can mathematically um kind of explore the concept of a single neuron
in a neural network so let's just jump in with an example x1 x2 x3 x4 x5 so these are going to be the inputs to a single neuron
n awesome so all of these inputs are connected to this neuron by some weight so w1 w2 w3
w4 you got the program w5 all right so uh what our neuron will do it will take the weighted sum of all these inputs so it's going to multiply an input by its corresponding weight so it'd be
x1 times w1 plus x2 times w2 plus x3 times w3 you get the program so wes multiplied this so to describe this it'll be the summation
of bi equals 1 and then up until n remember n is the amount of input variables we have and then we'll do x i times w
i and then we'll add our bias so that's going to be describing this mathematically so that's basically what i said um so we sum over all of our inputs that's basically what this means we go
from one to n and then we multiply the first input times the first weight and then we multiply and then we add that to the second weight times the
second input and then we add that to the third input times the third weight and then you get the point so we end up with a single scalar here that's kind of the most important idea we end up with a single
scalar so this evaluates to a scalar um and then we have this bias here so this bias is just a term that is
attached to this node it's a scalar that we add to this node so i don't know what this is but let's just keep that as b um so that's also going to be scaling
so we can see that these scalars add together obviously to be just one scalar and this sigmoid function well i use the sigmoid sine but now the sigmoid function isn't super
popular as they're called activation functions uh so i'll just call this an activation function i'll use the sigmoid sign but this can there's multiple different you know there's the sigmoid function which is a little more old and not as popular now
now most people use the rec there's it's such a cumbersome name but the rectified linear unit or relu which looks like this which basically makes all negative
values of this zero so it goes like this um there's lots of different functions that you can use um let's just say we're using the activation function uh i'll talk about
that later so just some more notation um uh this anyway so yeah so we got that down right so we can actually uh we can do something more
so we can describe this in a slightly more understandable way using dot products so this summation is kind of you know it's kind of a pain to look at so instead we can talk about the dot
product of two vectors so obviously the dot part of two vectors just to review let's say we have some factor three two one and then we sum factor
four five the dot product basically means transposing one of these and multiplying it so we end up with a the important part of a dot product is we multiply two vectors and end up with a scalar
uh sometimes it's called the inner product so let's say we transpose this one it doesn't really matter which order and then sorry i was running a room in this tiny whiteboard
four five six all right so and then we multiply that by four five six so then we multiply this times this plus this times this plus this times this so that'll be 3 times 4 12 plus 10 so
that'll be 22 plus 6 so the scalar output of the stop rod will be 28.
all right so that's just a super quick review and what a dot product is so that might uh kind of um seem relevant to this discussion about sum so we can actually describe
this as a dot product instead if we have two vectors let's say um the we only have two i'm not going to count so x and then we have another one which is the weights
so wait one way to weight n so for now n will be five right so then we can take the inner product of these which is gonna be the exact same thing so we're gonna be multiplying the first uh input by
the first uh weight and then adding that to the second input times the second weight so that's going to end up looking like x1 w1
plus x2 um sorry x2 w2 plus x3 w3 dot dot dot all the way to x and wn
so that's exactly when you think about exactly what this summation is doing so we can replace this with some simple notation for a dot product which as i said we transpose one of these and then we
multiply it by the weights so it really doesn't matter which order you do i'm going to do x transpose w so we're going to transpose x and then multiply it by w so that's going to be a scalar still and then we're going to end up with this
bias plus b all right so simple enough so that's kind of the simplified notation for a single neuron using dot products all right so that's nice um so i just
want to kind of hit on one more key central idea with this so this uh and one kind of thing doesn't come up often is this inner part here is going to be z so we're
going to call this inner part c so we can actually kind of represent this neuron in two steps we can represent it in x transpose w times b so that's kind of the weighted sum part
and then we can make this equal to z and then our second part can be putting it into our sigmoid or our activation function so then the activation of this neuron which is going to be a and we're going to add
some more superscripts and subscripts once you get more and more in runs but let's just keep it simple now so we're going to a it's just going to be the activation function with z so we can represent this in two
step it's often represented in two steps for reasons we'll see later so just the key idea i want to get across here um is that this single
neuron take is basically a function that takes in a vector and outputs a scalar so it's a vector to scalar function and that might think oh gradients and
i'll kind of get your mind going but it basically inputs a vector and then you get a scale that's all it is that's all a single node is
um so that's just kind of wanted to get across next up we'll look at layers of neurons and how that can be represented with entire matrices so that's very important so stick around
transition all right so one thing i just want to get out of the way is the notation for the weights of the neural network which can be a bit tricky at first so let's say we have an input x1 x2
and then we have these two neurons let's say maybe another layer of neurons and then we have that weight of that output all right
so generally the so now that we have before i just said w1 w2 w3 whatever but now that we have multiple nodes you can't say w1 because
is this going to be w2 and then what are the other ones going to the other node going to be it's going to be a headache so basically what we're doing is that we're going to split this up
into something in this notation wl um jk okay so what does this mean so this is going to be the standard uh notation for any single weight in our
network so here l equals the layer weight is going
into and i already said this before but it's just worth reiterating here so j is the number of the neuron in the
layer that it's going into okay so let's just so let's just say
node in layer l and the k is going to be the node in layer
l minus 1. and there's kind of a reason for this notation but it might seem a little weird but let's just take an example to make this a little more concrete so let's say that we have our we want to we want to establish what this weight
here is uh this weight what is that weight how can we represent this notation so let's say this is layer one this is
layer two and this is layer three so let's let's try to represent this weight so what is it gonna be so it's going into layer two so we can say the two is up there
and so the node in layer l which is going to be two so it's going into the second it is going into the second node in layer two so we can make
j equal to 2. again
we have a node and then in k we have the node in the layer before so which which node from layer l minus 1 which is 1 which node in this layer is it coming
from again it's two so that might be a bit confusing example because we have a lot of the same numbers but it's going it's a weight that goes into layer two
that comes from the fir from the second node in this layer and goes into the second node of this layer let's do just one more example
let's take this one this node uh this way all right so again w so now it's going to layer three so w3
so the k is the node in the layer before so where is it coming from it's kind of confusing because the k comes after the j but the k is the node it comes from in layer two
so it comes from the first node in the in the layer before so it comes from the first node in layer two so it's gonna be k equals one and then j is the node that
it goes into so again it's going to be one it's going into the first node in this layer all right so just some super lightning fast examples this one would be
w and you can kind of think about it for a second it would be w3 and then node in layer l minus one it'd be two so it'd be two and then it's going into one
all right that's the last example i'll do there's a lot of different resources that explore this sort of thing so you haven't gotten that it's really important to get that down so i guess for example for practice i'll
put up a couple here and you can think about which weights these correspond to all right so i'll let you do those all right
okay so now we want to dive into it's a lot easier don't worry but the biases of this so generally biases are the same for each node in a layer so maybe this bias
will be five for this layer and maybe it'll be six for this layer maybe it'd be -73 i doubt it but you know who knows it'll be a scalar for each layer
so generally the bias for one layer is signified by b l so for uh if here it was five so b
one would equal five generally sometimes you see it that we have b l and then um sometimes it's n that's a lot of different numbers it can
let's just say j sometimes this shows up which means that in the scenario different nodes might have specific uh biases this is used sometimes sometimes yes
sometimes no so take what you will so i guess if this bias had specifically a 5 attached to it and maybe this one had a 3 attached to it
then b 1 2 would equal 5 and b 1 1 will equal 3. all right so that's just some
simple stuff mostly i'll be using just this but that's good to keep in mind what else is there to think about i think that's about it for notation if there's anything that pops up later i'll
be sure to mention it but that's all you need to know for notation so weight notation bias notation can be a bit tricky but you'll get used to it all right so let's just dive in with some matrices
and how this actually operates all right see in the other side okay so now that we have the notation down we can actually jump into seeing how we can represent the computations for an entire layer of a
neural network so remember how before we had the one neuron and then we had this you know we had the five inputs that one neuron we explained how that this the output of this neuron
is a vector to a scalar function well when we're thinking of large networks of neurons so say we have maybe three neurons here and then we have x1
x2 x3 x4 x5 b4 so now each of these inputs go to a node and they have their own
corresponding weights so let's just draw that out
so each of these inputs is connected to one of these nodes by a weight so each of these nodes is receiving a weighted
sum of these inputs and their weights so i i guess you can see maybe this node here it has five inputs which are these five multiplied by five weights that are
specific to this neuron so since we saw that a single node takes in a vector of inputs and then multiplies it by a vector of weights
and then puts out a single scalar and then this one is going to be multiplying its weights by the same inputs and going to be giving another stable let's say that's s3 and this will give another scalar s2
and this is going to give another scalar s1 so these scalars form the new vector of inputs just like the x1s into our next layer of our
neural network so we send each of these to some neuron and then we're going to have the same thing so these scalars that are outputted from these are actually going to become a vector and this vector we'd like to call a l
which is the activations of layer l so if this is layer 1 this would be a one so you might think that okay now that we have this notation for uh not really notation but this
intuition as to what is happening with multiple nodes how can we represent this because remember that we represent a single node with this um we explain it as the transpose right so it would be
x transpose w plus b right that was a single one where this is a vector this is a vector and this is a scalar the bias right and then you put that through some sort of activation well now we can represent the weights of
an entire layer with a weights matrix so this looks like a capital w we usually use capital letters to signify matrices
uh wl so this takes care of for example all the weights in this entire thing for going from each input to each note so how do we do that oh let's make some space
so so so so so this weights matrix w so one thing to think about this weights matrix is that
the so looking at our old notation for a single weight how to locate one weight in the entire network is using this notation that we just learned so remembering that j is the number of the
neuron in the layer that it's going to and k is the number of the neuron in the layer that it came from so i won't review that again because i just reviewed that but you can go back and check that if you
haven't got it down yet but when we're looking at this weights matrix each entry in this weights matrix goes a bit like this i put it here just to keep just to remember for you and me but um
looking at the rows of this weight maker assumption so the rows are the j and the columns are the k so something in here
so in the j in the j throw in the k column comes from the uh comes from the kth neuron and goes to the jth neuron um that might
be a little hard and the whole thing is layer l so this entire matrix is layer l and this describes um coming from node k so here if k was equal to one and j was equal to one this is the position of
this one it would be coming from the first neuron in layer l minus one and going to the first neuron in layer l
so if we were to write that out so this one one if we were in an easy neural network like this and we were going from this to this this one one
would be going from here to here it would be going from it would be going from the kth neuron the first neuron
in layer l minus one to the first neuron in j which would be l i mean the jth neuron in l doesn't make sense it might be a little easier with an
example so let's just take an example i'm going to just make a super little simple neural network and we're actually going to compute the weights matrix by hand just to see how that goes just to kind of better
understand what's going on here so let's take a simple little neural network here and we're going to actually put some real values for the weights just to get a better idea of what's going on all right so we got a super simple
neural network here we got our inputs going in let's get rid of that we've got our inputs going in here all right so the inputs go into these ones and just one layer and then they process something and then goes into
here so how do we describe these weights here this first layer of weights w1 well okay let's give these let's give these nodes value give these weights values so let's say
this this this weight going from here to here is five let's say this weight from going to here to here is six why not
and let's say this weight going from x2 to node one in the first layer is going to be two and then this one going from here to here is going to be three so hopefully you
got that down so this to here is three this to here is two this to here is five and this to here is six okay so now that we have that let's see if we can construct a weight
weight mixer matrix for this layer so let's create a weight matrix one because it's layer one it's the first set of weights so this is
going to be two by two so remember that the columns are the k so the columns are which node in this
layer sorry the k is where it's coming from so it's which input is it coming from so if it's coming from the first input to something it'd be in the first column
if it was coming from the second input it'd be in the second column so now we can look at j which is the rows so if it's going into the first node it would be going in the first row if it
was going to the second note it would be going to the second row so that's kind of how we'd locate something here we can locate it by its position so what's here so this is the fir 1 1
position 1 1 position so this would be going from k equals one to j equals one so going from one we're going from k equals one to j
equals one which way does that it's five so here's gonna be five next up we're gonna go from the first row so one uh we always index matrices row and then
column but that's you should probably know that by now um but this is the row right so we would be dealing with what's here so this is one two so this would be going from this
would be going to node one but it'd be in k equals two k equals two so that would be coming from
the second input going to the first neuron so that would be what goes from here to here would be two uh just the reason why that is because it's in the second column
which means that it's coming from the second input because that's the columns are k and then the rows are j so it's still in the first row which means it's still going to the first node okay so now we'll steal this one so this
one's two one the position of this is two one pressure sounds one two as well so this is going so the k is equal to one so it's coming from the first node but now the row
is equal to two so it'd be going to the second node so that'd be three and then now we're going from the second node to the second node the other layer so
that would be six okay so hopefully you got that down so this is the weights matrix and i'm just going to draw out the notation for so remember how we had that w transpose
x plus b notation for the function of a single neuron we can now represent the input to an entire layer and the output of an entire
layer as the following so that's our weights matrix so l times a l minus 1
plus b l so i'll explain that a little bit well in a couple of couple minutes uh well i'll kind of play this out but let's actually do this and see um
why this works so um this a uh l minus one by the way is the activations of the previous layer so remember how we were talking about when you have a whole layer of neurons
those scalar outputs become a vector so that vector would be an a vector so for the first layer our a is the x inputs but for the second layer the a will be
the outputs of the first layer and so on so on so a lot of the time we represent the x input as either a0 or something like that because it's just kind of the first activation you can
almost think of it like that so we have this calculation so why does this work so let's say our inputs are like 2 and let's take some
unique numbers maybe let's take 10 and 20. so those are our inputs 10 and 20. so these numbers might get a little big
but let's see what happens so this formula tells us to multiply this weights matrix by the activations of the previous layer so multiplying these weights
by these inputs so let's try them so that'll be 10 and 20. so x1 x1 x2 so 10 20 the activations of the previous layer
so to do a matrix multiplication so this is obviously a 2 2 matrix and a 2 1 matrix or input's going to
so our output's going to be size 2 1 so our output is going to be a vector and that matches up to what we're thinking because the outputs of this layer should be a vector of two numbers representing the
activation of this and the activation of this so these are two scalars outputs and those become a vector which will be represented in a two by one vector so that should make sense so that's
so it's making sense so far so what else do we do in this formula um so now we multiplied by that so let's do the matrix multiplication we'll get a two by one vector so five times two is going to be but
let's just think about what we're doing here right so we're multiplying this weight by well we're multiplying this weight and this weight by this input and this input so what does
that mean when we're looking at the actual weight so 5 is the one connecting here to here and two is the one connecting from um sorry the two is the one connecting from
uh here to here is that correct yes that is correct it's going from the second note to the first note so as you can notice that multiplying this times this calculates the weighted sum going into
this neuron because this neuron takes in this input and this weight and this input end this weight and multiplies and sums them so when we're multiplying this
times this and summing them we're doing the exact same thing that we were dealing with when we're doing a single neuron where we took the dot product so multiplying this and this is kind of like taking the dot product of the
inputs x1 and x2 with the weights going to node 1.
so it's the exact same thing so we get the scalar output of this which is gonna be five so it'd be five times ten so fifty plus two times twenty so that'd be forty plus fifty which could be ninety
it's a big that's a big output so the activation well so far because we haven't uh we haven't added the bias yet so it's just this part
um so far for this node is 90.
um so uh yeah so so far the w transpose do you remember in the in the single neuron we have the w transpose x so currently the w transpose x of this is going to be 90
because it's the weighted sum of our inputs and the weights into this neuron this 5 and 2 go into that and we multiply 5 and 2. so now let's do the exact same thing for this one so it's taking the weighted sum of all
the weights going into this node so that's going to be two and six i mean sorry three and six and that works out three and six very nicely here so we're gonna multiply three times ten it's gonna be i've gotta regret choosing such big
numbers 30 and then uh 120 so that's going to be 150.
okay so we're going to multiply that by that and that's the exact same calculation we're taking these inputs and we're multiplying by the weights going into this neuron which gives us the exact same thing of doing the dot product with a single
neuron so by using this matrix multiplication we can kind of do the exact same operation but all at the same time so it's a lot easier next up we add our bias and this isn't that important because you know it's
pretty easy to understand it's not that new but let's just kind of you know keep going with the example so usually a bias is the same for the entire layer of neurons so
let's say the bias for the entire layer is plus six it can be a negative number we'll say it's plus six so we add six to each of these uh results
so we get 96 and 156.
so so so now we've calculated this entire inside which we often call z so we can say that z 1 equals this vector of 96 and 156
but we have one more step we have the activation function i'm just going to go through all of this with you so you can kind of better understand what's going on obviously i'm not going to be doing this with larger neural networks but this kind of gives you a good idea of what's happening
so we have this output vector z1 so now we just have to put the z1 into an active into an activation function so this is basically what we're doing now we're just putting the c1 into an
activation function and then this is actually the vector 96 156 and so this is this activation function can be
any kind of non-linear function you want um i guess a popular one now and it's kind of easy to compute for me so i'll use it as relu so we'd have this thing so basically z is equal to z
uh unless it's negative if it's negative c is equal to zero pretty much so both of these are non-negative so when putting them through the activation function it actually stays the same because z will see if it's you know it just goes
up as opposed with the sigmoid where it's either zero or one so if we were if we were using the sigmoid these would be where these would be close to one i would say um but let's use the relu that actually
works better it's been proven to work slightly better um but anyway so we have this right so 96 156 after going through the activation function is still 96 156 and
that's our scalar output for this layer so now we can say that a1 equals this so then again we can think about putting that next layer and so a1 so if you
remember before we had the x as our uh thing in our function let's just kind of keep that up here just it's important so remember how we used to have our x's be this and this
so in the next layer we would have our new weights matrix and we have this and you kind of see how this works and this would work in a much bigger scenario as well and you just keep going
through the network and doing that so just one more thing on how i represent neural networks um i find that it's pretty easy way to kind of represent multiple layers of a neural network of course the visual ways to write something up like this some sort of
graph but a kind of a quick and easy way um and actually really helpful when you're trying to derive things is kind of writing a chain of these multiplications so kind of an example of
something like that would be okay so we have our inputs right so let's say let's call these inputs a0 so and then it's just kind of a string of calculations kind of
expressing multiple layers of a network as just lots of different calculations of this same thing so we can say oh we put this into this and we have w1
and then a0 and then plus b1 right so that's our that's this calculation although it may seem small calculates the entire first layer the output of our entire first layer of our network
so remember that we all we're still dealing with single training examples here so our input a0 is a vector so this vector gets fed into this and the output is another vector
because if we have multiple neurons each of those is going to set out a scalar and that will become a vector so this tosses in a vector and takes out a vector so it's a vector to vector function and remembering what jacobians are you
can start to think about what the implications of that are but we put a vector in and we take a vector out so this vector can this vector a
uh a1 so this so the output of this output of this would be a1 so now we can put this into another layer of the network i'm just going to raise this so we have some space
we can now toss this to another layer of the network and get w2 so our new weights matrix times a1 plus uh b2 sorry i'm using subscripts now sorry there we go so that's our second layer
of the neural network and we can just keep going and i'm going to be using this kind of notation quite a bit um just because it's a lot easier than drawing out all this so yeah so i think that's
enough to say about how we can compute neural networks so this is kind of i guess the feed forward aspect and never really mentioned that work but this is feeding forward a neural network feeding an
input through the network the first time so we have a network like this now we're going forward so we're feeding the input forward through our network
so we're just getting an output so at the beginning before training our network this output is going to be really crappy but as we'll see later through back propagation which will be going backwards we'll be improving the weights and
seeing how we can make this feed forward output better so at the end of the day once we fully trained our network um that's all we're doing we're just feeding forward so if you have a fully trained network
we can feed forward say an image which is a bunch of x ones x x n and that comes out to be some sort of classification or an answer so right now we're focusing on the feed forward but we're going to quickly dive
into the back propagation since the feed forward the feed forward aspect of neural networks is not too hard to grasp um but we kind of covered it pretty quickly here but if you want more details on that uh definitely there's
resources for that and i'll probably put them in the link in the description so um so i'll see you next time i'm not really sure what we're gonna do next but
you'll see in post so yeah see you soon all right so now that we've covered the feed forward aspect of a neural network or how we can feed our input forward through our
network now we're going to focus on back propagation and that's kind of the focus of most of this lecture is back propagation just because it's a little technically harder to grasp and mathematically harder the grasp as well so
we're going to be seeing how we can lower the cost of our final output or the accuracy of the error you know minimize the error of our output by changing the weights and the biases in
our entire algorithm so how we're going to do this is by going back through the network and seeing how each of these weights in our network affect
the total costs so if we tweak this random weight in our network will that increase or lower the final cost and using that information we can tweak those
weights and biases accordingly and get a better answer so that's basically the entire goal of back propagation so the first thing i kind of want to cover is just the cost function and this is just a quick introduction
and i'll go much more in depth in both of these over the next few however long this ends up being but one thing i just want to cover now is the cost function we're going to be using because i'm not going to go into that too much the name of this
cost function is one of the most popular or the most popular cost function is uh called mean squared error so you might be very familiar with it um basically we cycle through every
single one of our our outputs and then we compare them to our uh actual real answers um and then we square the error and then
we sum over it so basically in um numerical terms we take a summation of m all of our training examples if we put multiple
examples through the network and we want a total output cost and then we have y equals one and then we have our x output so it's going to be well what's a good letter for that uh
most commonly used as a y hat so y hat will be the total output of our network so we're gonna take y hat and negate it from the actual answer
and square that so that's what mean squared error is so we take our answer that we get from feeding forward and then we're going to take the actual y answer associated with that training example find the difference
and square that difference uh and then i forgot a part this is the most important part that's where the mean comes in we're going to be taking the uh mean or just two times the mean um
well one over two times the mean um in our cost function and this the reason why this two is through which possibly when i first figured out the reason is when you try to derive the um the cost function and if you're
familiar with deriving it you'll notice that the 2 comes up because of the power rule so this 2 cancels out with this 2 and it just makes calculations a little nicer so that's the reason why it's two here
and that doesn't affect uh calculation whatsoever because you're multiplying everything by two so it really doesn't change anything anyway so this is the cost function we're going to be using the mean squared error cost function
other than that we're going to be dealing with a very simple scenario at first and i'm going to call it i guess the simple model and it's going to be really similar to the one we just saw it's going to be
x1 x2 x3 x4 and x5 so 5 inputs going to a single neuron so this is going to be a vector turned into a scalar output so the activation of the single neuron and we're just going to put this to a
cost function so we're just doing a very very simple neural network with a single node and this is how we're going to first kind of explore how changing these five weights plus the bias here is going to change
our cost and we're just going to kind of kind of explore this idea of using jacobians what we just reviewed at the beginning um and seeing how we can calculate basically the derivatives
of the cost function with respect to each of our weights and our bias and seeing the derivatives of each of these and just seeing if we change this how will this change if we change this how will this change
and we can use that to better tweak these weights and increase the accuracy or lower the error of our final output so we're going to do that by calculating derivatives and we're going to be
talking about gradient descent so how can we use these derivatives to create an algorithm to automatically find the lowest cost so hopefully this sounds good and next
up we'll dive into how do we start to using the laws about jacobians how do we start to derive the operations that go on in a single neuron
so look out for that all right so now that we've talked about the operations that go on within a single neuron and a layer of neurons um now we're going to try to find the
derivatives of these uh different operations so we can go back and try to find the derivatives of an entire neuron with respect
to the weights and the biases so how do we start doing this well let's start with so there's three different things i want to talk about here um i'll leave this one for after the first two because these first
two are kind of uh tied together uh we're going to try to find the jacobians and uh and um derivatives well they're the same things of uh the binary element-wise operations and the
hadamard product the hadamard product actually is a binary element-wise operation but i guess a good question asked now would be what is a binary element-wise operation
so a binary element-wise operation is some sort of function that takes in two vectors v and w and returns a single vector let's call
that what's a good b why not all right so what does this mean so there's a few more restrictions on what it needs to be it needs to be it's
pretty self-explanatory the the operation that you do on these v and w needs to be element wise so if we have some vector v and we have some vector w
and we're doing some element-wise operation between them it needs to be element-wise so it has to be an addition so the output of this function needs to be maybe this first element of this plus the first element of this equals the
first element of this first second element of this second element this ecosystem so this could be an element-wise multiplication this can be an element-wise comparison so the output could be like a
zero one zero zero one where um when when the output is one that means that's this element of w is bigger than this one so this is true
so basically a true false like boolean output or it can be just a normal sum notice that our hadamard product is a binary element-wise operation when we're
using a multiplication that's exactly the definition of a hadamard product so yeah those kinds of things so how can we take the derivative
of of a binary element-wise operation and obviously binary is binary because there's two vectors involved well let's put this in our standard function notation so
let's take some vw um and then say in explicit notation f of v and then some sort of
operation some sort of element wise operation and then g of w so the reason why we have these f and g's here is that sometimes we might want to do something
on the vector before comparing it so for example maybe you want to multiply every element in v by 5 before comparing it to g comparing it to w or we maybe want to
multiply every element in v by 5 and every element in g by 6 or maybe you want to take the log of every element but the thing is these functions don't have to be element wise
this total function was actually changes to a capital f this total function needs to be element wise this this operator needs to be element wise but these operations here don't
so for example maybe this function could do something where if you had some vector v each element in v was became like the sum of all the previous
elements or something like that it doesn't have to be this mini function in a function doesn't have to be element-wise and you might ask why are we including these mini
functions in this big function anyways it's just for the sake of generality sometimes we might want to alter our inputs before doing some element-wise operation often not like the hadamard product we
just directly multiply uh v and w element wise so in this case f and g would be constant functions where f of v equals
v and g of w equals w and you'll see a lot of the time in element-wise functions this is almost the most uh popular and uh common case is where these f and g bound
uh functions just do nothing but you know it's good to be general and you know make sure that we cover all different cases so how do we take the jacobian of this well when looking at this usually at the multivariable functions
that we were looking at in the introduction we were we were passing just single vectors into the function so maybe some x y
which is like a two by one vector but now we have two vectors um so what does that mean so that means we can get two jacobians out of this one with respect to the elements of v and one with respect to the elements of
w so we're going to end up with two jacobians so some this will be one of them this would be all the elements of w and this one would be all the with
respect to all the elements of v awesome so we'll just find one of these i think i'll do v uh it really doesn't matter but uh at the end uh i'll i'll tell you how
similar these uh jacobians end up being just in terms of uh computing them you just need to substitute a few things and find a few different derivatives but um all right so how do we approach this
how do we start finding the jacobian matrix for this uh function well let's do a similar thing to what we did before in our practice with jacobians so let's
make a vector um and kind of was so the output of this is going to be a vector it's going to be a some sort of vector because we're taking
two vectors doing something on them and then outputting a vector so what is this output going to look like if we're going to display it as a vector of functions
well it's going to be some f 1 of v and then some element wise operation and then some g 1 of w and then we'll have some f2
of v and then some element-wise operation g2 of w you get the point this goes down to f
n of v and then some wise operation g n of w and the reason why we in we don't index these variables v and w is as i said these functions f and
g don't necessarily have to be element-wise so this maybe f2 might use all the variables of v that's why we don't add some sort of element as some sort of
index there because that's going to limit to limit to only element-wise uh computation or like just a single value but we don't want to do that we want to let this function be able to compute anything
with all the values in this vector same thing for all that but we do index the function functions just to um just to organize them better and to be able to calculate the derivatives better
uh with respect to each of these functions okay so um awesome so what we're going to end up with is some jacobian a change in f
uh change in let's go with v so what we're gonna do is we're going to so remember how the rows of a jacobian
deal with so let's index these functions so capital f so just keep track of them so all of these rows over jacobian is going to be the functions and each column is going
to be how that function changes when we um when we change a particular element of a v so let's take the first thing here so
this is going to be f1 and how does f1 change when we change v1 so the first element in v well let's get rid of this let's explicitly write out
what f1 is going to be so it's going to be something like this so well it's going to be exactly this actually but this will make it a little clearer as to what we're doing
so that's a change in v1 and how does v1 how does v1 affect the output of this so let's just keep doing these and uh write out our whole jacobian
and see if we can find any patterns so let's go with the change in v2 now because we're going columns but we're still dealing with f1 i'm going to run out of space that helps
run out of space oopsies i forgot a thing here v uh kind of running the space but you get the point it's going to be all the first functions with respect to different
variables so let's make it general let's go to the end so that's going to be some nth um nth value of v in the vector v so it's
going to be f of 1 still a v and then some element wise operation and then g 1 of running of space but you get the point
and then for our second layer it'll be a change in v 1 again but this time with f 2. so now
it's going to be this function 2 here so how does the change in the first element of v 1 change this second function here and g2 is also obviously goes down as
well so i'm just going to complete this and this is just for completion just to make things obvious all right i might pass forwards this in the tape i might not i might make you
sit through this who knows blah blah blah and we're on space and then for this last row we're still dealing with the same variables but now we're dealing with the nth function here so how does
each thing in here each variable affect the last function i might be getting sloppy here f of n so we don't need the nth functions
uh kind of yeah so this goes on to something like that now this is not very neat but um hopefully you get the idea uh so basically here
for each row in this jacobian we're going to be computing how this variable v1 and v2 or whatever it is
uh changes the each function but if we look at this we can notice some patterns say if we take this right this should be okay but let's see
we'll see why it's okay so this derivative here this partial derivative measures how this first function here changes when we change the first element of v1
now in our case when we're dealing with things like hadamard products and and things where g and f don't manipulate anything we can kind of
just ignore g and f because they're constant functions so let's not worry about that so for example let's get rid of f1 and g1 so we're just left with this
so this is going to be fine because this is an element wise product so we are dealing with the first element of v and the first element of w because in this first row here that's what we're doing we're dealing with the
first element of this vector so that simplifies down just to v and the first element of this one which is simplifies down w so so v1 is going to be okay
because there will be a change when we change v1 it will change the output of this because we're doing computation with the first variable of the and the first variable of w but
let's look at something like this so this will be a non-zero number but let's look at something like this this is measuring the change when we change v1 so still the first element in the vector
but when we change the second function and the second function if we look at the same way still deals with the second values in v and the second value in w but we're
taking the derivative with respect to the first value in w and v sorry so there's gonna be no effect because v one doesn't play any effect on this element-wise operation but this is an element-wise operation
between v2 and w2 so this element this so this goes to zero i need water but i'm going to finish this in
so this evaluates to zero because there's no effect v1 has no effect on the second function so it is here because v1 doesn't have effect on the third function because the
third function deals with v3 and w3 so this becomes zero as well and then looking over here as well the second v so the second element in v has no effect on
f1 which is what we're dealing with here because f1 deals with v1 and w1 not v2 in any way so this equals zero so what we're going to see is we're going to see this pattern we're going to
see anything off the diagonal will turn into a 0.
so we're just going to get this diagonal matrix of non-zero values and everything off there is going to be 0.
need water but the thing that we want to mention and the thing that i want to mention that i want to really make clear so seeing from that and extrapolating that to all sorts of non-zero numbers
this is even water this is coffee but that helps so one thing we can extrapolate from this is that for all element wise functions
the jacobian will be diagonal because we're going to be dealing with only some v nth with some w and component at each function so if we're taking the derivative
to something that that's not the vn or wn component there's going to be no effect on that function so for any element-wise function we're
going to get a diagonal jacobian matrix so let me write that down that's important enough to write down the jacobian of all element-wise
functions are diagonal and that's a big thing and that's really uh if you're thinking computational wise computing so i'm not going to erase that just yet
but if you think of that computational wise that's a lot cheaper than computing an entire jacobian matrix because you're just dealing with the diagonals
and a common way to write this is diag short for diagonal obviously um and then all your diagonal non-zero values so if you had some sort of
diagonal matrix like a b c then you would have diag a b c and then this would say that oh yeah it's a this place b in this place c in this place and everything else is zero
it kind of almost multiplies these values by an identity matrix so that's important to note and then once we move on to
uh the actual uh the the product the element-wise product this will be a lot easier to you know understand because uh the element-wise product actually is obviously an lmi was
element-wise binary function so that'll make deriving that a lot easier so i'm gonna get some water and we'll talk about the hadamard product and driving that as well as
uh how to derive a sum because that's basically all the operations in a neuron and then we can get to actually deriving a neuron's bias and weight so yeah is there anything else to mention i
don't think so but that's pretty incredible and uh let's get to the next thing i'm gonna get some water all right so let's just quickly talk about the derivative of hadamard product
this is actually gonna be a pretty quick discussion since we already covered the derivatives of binary element wise functions and the hadamard product is a great
example of a binary element-wise function so the hadamard product i already mentioned what it does it element wise multiplies two vectors so how do we take the derivative of this
with respect to the elements of either w or v so let's just quickly write this up so our function capital f
takes some vector v some vector w and returns to us some f so just keeping the notation uh until i prove that it you really don't need
notation in a hadamard product you really don't need the functions the mini functions g1 of some w and then some f2 of some w i mean v and then some g2 of
some w and this goes on some fn of v and then some gn of w okay so the reason why in a hadamard product
in a hammer product you have some w and some v and you're multiplying them you don't do anything to the vectors before you multiply them because especially in our case when
we're thinking of hadamard products in the um in the context of substituting for a dot product we're kind of just breaking a dot product down to a hadamard product the sum
so we don't we aren't multiplying anything because we're trying to mimic a dot product so in that case we don't need these little functions
because we don't do anything with the vectors we just take them in but the thing is that before when he did have these little vectors we would we never really put it down but
we would technically be indexing the output of this little function here because it's still element-wise so even though these functions are not element-wise this little mini one it still needs to be elements these two
so this say this is like the well this is the nth row of this output vector so this is fn uh f of v n
so it's the output of f of v and then the the the nth index of that and then it's the output of some g of w
and then the nth uh the nth component in the output of this and then we're comparing those element wise but now that we don't have that mini function around it
we just assume that we're dealing with the v nth and the wn component of our input vectors uh because we don't have that function anymore so we can put our notation on these
so you'll see this will just end up being this okay so that's simple so now looking at our jacobian for something like this with respect to v let's keep respect to
v so it's gonna be some i'm not going to write out the whole thing but i'm going to say some v1 also this one so that's going to be that and that's
going to be some vn this is going to be f2 now v1 and then let's just make that and then some f
2 sorry and then to some v 2.
sorry n and then we have this going down to some fn awesome all right so now that we have this just looking at each of these and these are going to be now products
because this is specifically the hadamard product so for each of these so how does the first function change with respect to v1 well it's going to be it's going to change by a factor of w1
because you're going to just be multiplying these so just think of these actually just as multiplications because you're already looking at this element in this element so it's already in element wise so it's just a multiplication of the first element of
this and the first element of this the second element this and the second element of this so they're just going to be multiplications and when you're deriving a multiplication or you're differentiating multiplication
uh and you have integrand differentiation with respect to x and you have some x w with respect to x it's going to be w right so it's going to be the exact same thing so
this is going to be w1 and again this is going to be a zero i don't think you need me to fact check that anymore anything off the diagonals of an element wise functions mu zero so we're going to end up with this it's
just going to be some w2 and this is all going to go down to some wn so same thing around if you took it with respect to w1 so this will be v's instead so if you
took it with respect to w these would be v's but it'd be the same thing and then that's pretty easy to kind of believe yourself but yeah i just want to cover the derivative of a hadamard product so that means we
are about halfway to deriving a neuron because they just need to cover the sum now i need to cover one thing called scalar expansion
or deriving scalar expansion which i'll go into right after this all right see you there the derivative of uh binary operations
we can deal with the two slightly easier ones uh one that i forgot to write down on the previous one but kind of came up to me was the derivative of a scalar expansion uh
which basically means the derivative of multiplying a scalar by a vector um again all of this was kind of inspired by a howard jeremy howards and terence
parr's paper on matrix calculus um but that's mostly just for the initial kind of information and i'll be adding more on with regards
to back propagation later um but this will just get the kind of ground work done all right so scalar expansion so let's deal with scalar expansion first so scalar expansion is when you multiply
some maybe two times a vector v actually let's write down the elements so the results should be familiar is
something like 2v like this we just multiply every element by this or we can add every element to this or we can divide every uh so i guess it's so what you can think
about it is and what is actually going on and what happens in any sort of computer program is that
we broadcast this scalar so it's called scalar expansion because we expand it to be a vector of the same size so something like this and then we just do
an element wise operation so that's the exact same thing as just multiplying by a scalar but we never actually technically multiply by a scalar we always
kind of in under the hood uh it's broadcasted into being a scalar of the same size so what's the derivative of something like this a scalar expansion
or both the scalar expansion and the execution so what's the derivative with respect to v the elements of e and what's the respect what's more interestingly and more new i guess is
what is the derivative respect to that z that we multiply or add to each element well we can do this in similar terms so
remember how before we had some f of x some v and then w so now we're only taking one vector and some some uh scalar x
right so then our output is so we can keep this kind of function i must keep let's keep this capital so we can make this some function of our vector and then some element-wise
operation and then g of our scalar x where our this function g of x equals the ones vector
times our scalar x so if you're not sure what this means this basically this g of x is expanding this x into a vector by multiplying something
by the ones vector you just so if it's 2 and 1 1 1 this just becomes 2 2 2. all right
so that's that's kind of the mini function there so this is the case where that mini function inside of the bigger function is actually helpful so all right so we can do the exact same thing as before where we can
represent this as a vector function so we can make it some f 1 of v and then some element wise operation and then g one
whoopsies g1 it's not pretty g1 of x and then f2 of v some operation 2 of x
1 1 f n v this and that's my element wise thing and then n of x awesome okay so now that we have that so this is our
function this is the output whatever it may be it's going to be a single vector so first of all just to kind of imagine i guess in our head or we can write it
out a little bit what the jacobian is relative to uh with respect to the elements of v
so what f of v looks like looks relatively similar so we're going to be taking the elements of um we're gonna be taking the elements of so it'll be
uh i won't run out the full thing this time so we're just gonna say f1 with respect to v1 and then f1 with respect to v2
and then f1 respect to sorry v n and then we're going to have some f 2 with respect to v 1 some f 2 with respect to v 2.
and that goes on and then this goes on downwards so that's going to be some fn with respect to v1 some f n with respect to some v
2. and this goes on for some f n respect
2. and this goes on for some f n respect to some v okay so this is going to be our jacobi matrix but what are these going to actually look like
so what is our how does our vector are the first element in our vector 1 um change the output of f1
well it's kind of the exact same as it was before like it'll be some sort of non-zero changing thing it depends on the operation but if you were to take if you were to say maybe
in multiplication and if you're saying that this f1 doesn't do anything so it's a constant it would be [Music] yes it would be x because you would
multiply so you just end up with this vector and then you can think of now that you're getting rid of this f you can confirm that that's the first element of v one because the only reason we didn't index
these as i was saying before is this f might deal with multiple uh variables but now that we have no function that we we have to make it the first element of v1 because before was the fact that we
could have some v and then like this and then we would be indexing technically the um the entire vector uh but
now that that's gone we can just say v1 and then some sort of and then well this is still important actually because this actually does something so it'll be some v1 and then times
the scalar because this is all this vector is just going to be a vector where every single element is the same it's going to be x so this scalar product is just going to be v 1 times x
so that's going to be what is that going to be well that's just going to be x so we're going to see the same thing
we're going to see a diagonal jacobian matrix once again and we're going to see obviously v1 has no change in v2 so we're going to see these become zeros so for the
with respect to v uh we're just gonna get the whatever the scalar thing is for a scale for a multiplication we're gonna see a something like this
and um yes that's it that's it for the v uh with respect to the v's so now we're going to think of something a little more interesting or a little
new more new at least is um the derivative with respect to x so x is unique in a sense that x is a scalar
x doesn't have any indexes because it's just a single number so all we can do is see how this scalar changes things in each of our functions
f1 f2 etc because it's not you can't go horizontal because horizontal along the jacobian means that you're indexing into a vector because it's going to be
some v1 v2 blah blah but x doesn't have any indexes so all you can do is go down and see how x changes the different functions so when we're taking the derivative with
respect to x what we're going to get is a gradient and not a jacobian because we're going to get something diagonal and when you think about it this makes sense because when you're doing a gradient when you
have a gradient that means you're dealing with something that inputs a scalar and outputs sorry inputs a vector and takes out a scalar and now essentially what you're doing by taking
the derivative with respect to the scalar you're doing kind of that sort of thing because you're kind of disregarding how this affects it so we're doing yeah so we're going to get a gradient so we're going to get
some sort of gradient of f with respect to x so we're going to get
so how f1 changes with respect to x how f2 changes with respect to x and all the way down to how f n changes with respect to x
and in this case if we continue with our multiplication that's just going to be v1 v2 and vn
just by using you know simple scalar calculus rules because this will be some number and then we're going to be multiplying by this so it's going to be we're just going to eliminate that
um and originally get this so simple enough so the vector of uh
the vector the gradient of uh with with respect to x is just going to be a vector of the v one so it's actually going to be equivalent to v
but it won't always be so if it's a plus it'll actually just be a vector of ones because you're just gonna be you know when you have when you're doing a scalar derivative and you just have
one of your own variable it's simplest one so same thing here so that'll be just ones if it was if it was a um element-wise subtraction then it'd be negative ones
uh i'm trying to think as any other differential things that are kind of interesting not really uh but yeah so that's just you you get the idea
um so that's a scalar expansion next up we'll talk about some all right so now we're done with the hard part which is differentiating a hadamard product or a binary element-wise function
now we have something easy we're differentiating a sum so what is the function that we're kind of trying to differentiate it's something that takes a vector v
and sums over the elements all the way up to vn so really what our function looks like is the summation it's an ugly summation is
up to n starting at one and then some g of v so the reason why we use g again is very much similar to our um our element-wise products it's that we
might want to do something to each value of the that might not necessarily be element wise that's why we're not indexing the v we're indexing the function in my opinion an almost better way to
show this would be something like this so you might want to think about it like that so example of a g would be maybe going to multiply each element by two before we add it
so let's say that's our thing so we just multiply by two so that means this will get a vector like this so each of those is a 2 and then we
index into the i value of this new vector the reason why we don't put something in here is because similar again we might have a function here that doesn't so we'd be indexing into that vector
whatever it is we're indexing into this kind of new vector you can think of it like that anyway so what is the derivative of um this summation the summation
function with respect to each of the elements so how does each element in the vector change the overall summation so how does changing each element change
the overall result that's what we're asking here so now with the jacobians we used to have things where we have multiple functions so each row is a function but in a summation we only
have one we have the summation so now we're going to have one row and when you're thinking about it a summation is a vector to a scalar function we're taking a vector and we're summing over and getting a scalar
so it makes sense that it's something like a gradient not a jacobian so we're going to do is we're going to have a single function which is the summation s we'll say the function of the summation and each element here is going to be the
summation with respect to a variable in the vector so let's just write this out to make it concrete so the derivative of s with respect to some v1
derivative of s with sum of a2 all the way to some vn so actually expanding this or expanding the s what we get is something like this
trying to write this out as fast as i can so this gi is always going to be the same obviously because we're in one single row we could index it as g1 or we could just
index g that might be the best way to go but we do want to sum over it so not really actually let's keep this and this is obviously the same summation over i i just don't
want to write it all the same and then we end up vn for g okay well uh from here we can take the rule that the derivative of a sum is the same as the
sum of the derivatives so we can just swap around the order of these so we can put the thing out here and then we can just put our derivatives in here
and soon enough this will reveal something interesting
i completely forgot to do the summation on that one i just realized okay so now that we have this so looking at this right so when we're taking the derivative with
respect to v1 over so let's first consider the case where g is constant so we g is just nothing so g of v
equals v so there's no change let's consider that so just a pure summation in that case we're not going to have any of these g's
and we're just going to have this so and now we're gonna have a vi because now we won't be indexing those functions anymore so now what we're gonna have is that the
derivative will be zero on each summation until that v that this i equals one then it'll be 1.
so basically in this summation we're going to go through i equals 1 and i'll be oh 1 because then this will be v 1 and the derivative of v 1 with respect to v 1 is 1 but then it'll become i equals 2 so this
will be 2 so it'll be plus 0 plus 0 all the way until n so the result of this would just be
1 this would be 1 plus all 1 plus n minus 1 0's pretty much so again here right so when i equals one then this will be zero but then when i equals two then this
will be v2 with respect to v2 which will be one and it'll be zero for the rest of until n so this will be one again so i might be seeing a pattern here
the the entries of this gradient are just going to be all ones and if we have some g uh for example if you have some g that multiplies so some z that multiplies each thing by
z well then it's just going to be z one plus z times zero times plus z times zero times plus c times zero so it's just
gonna be z so if we have some function g that multiplies each thing by some scalar z
then our output is just going to be this so you can see that the outputs of a summation are usually if we don't have any g
just one the ones vector transposed or it might be some z times the transpose one vector or might be some other
element-wise operation that we do on each scale so that's the uh derivative of the sum pretty uh pretty easy stuff all right so i'll see in the next time
we covered all the sums so what we're going to go into now is actually deriving a neuron with respect to the bias and the weight so that's going to be exciting alright see you there okay so now this is exciting now we're
dealing with neurons and we're trying to take the derivatives of neurons with respect to the weight and the bias so let's take our running example and what we're going to do now is we're not going to worry about the cost
function just yet we're just going to find how the activation of this one node changes when we change the weights and the one
bias of the system so there's six things that we can tweak and then we're trying to find basically the gradient with respect to um
all the weights in the bias so what we're going to do here so let's write out the notation of the single neuron and how it processes things so it's going to be something like this right so it's going to be a w
transpose x plus b and then that's going to be what our activation is equal to so we can do some kind of so basically this is something that you would use the chain rule on because you have this function here and
you have this function inside so if you're trying to take derivatives with respect to w or b you can't just go straight from a to w you need to deal with this intermediate activation
function so you can do some kind of chain rule investigative work where you can say okay the change in a with respect to um let's say w we also want to do
uh b as well is equal to you can't go straight to w because there's something in between there's this function so we can say well this inside is z right so we can change this to the change in a
when we change z which works because then that's going to be basically the re-blue thing or the activation function depends which one you're using and how does the change when you change the bias
i mean the weight over here it's the bias okay so perfect so that's kind of what we're looking for so how can we use what we used before to um how can we use what we've learned
to differentiate this so you have all the tools we need well remember how we separated a dot product into a hadamard
product and a sum so what we can do now is we can just kind of rewrite this as this new funky thing so now we're going to write sum
of w hadamard product of x plus b that's exactly equivalent um so now we have a couple more layers in our chain rule thing because now we
have another function so let's say that this inner thing here the head of our product is going to be h and this thing here is going to be s
of h um perfect so it's the sum of h so now it's going to be how does z change when we change this sum how does the sum change when we change
this h and how does the h change when we change our x w sorry so writing a few more layers onto this same thing here actually not different
thing here this one we can still go straight because this bias is completely independent from all this so when we're trying to take a derivative with respect to b that's just kind of all zero sum but there's no b related to any
of this all right so these are the two chain rule things that we kind of need to figure out so we already know how to do most of this so in general how does the hadamard product
uh change when we change the one of the um one of the one of the multipliers in the hadamard product well it just becomes remember our jacobian
is just a diagonal matrix of the elements of the other vectors that's me x1 x2 xn and these are all going to be zeros
in our head of my product obviously i won't explain why because i already did so all right so that's going to be what this is right we got that covered so let's actually label this
accordingly all right so we got that covered so we can also show that as diag of x1 all the way to xn and that's kind of what we're going to be using i'll just
treat it like a vector right a horizontal vector or a vertical vector just a vector we can treat it almost as vector because all the information would be stored in a vector
so now how does the sum change when we change h so this is why i meant you should probably think of this as a vector because now we're going to be summing over all the elements in this so we can kind of think about it as
summing over all the elements in this diagonal vector so since the sum isn't anything fancy we're not multiplying anything we're not like doing that so remember we have that g in there something that's going to just be a
constant function so it's just going to be 1 transpose because that's going to be the output derivative of our sum when we don't have any g
so so far so our derivative of s with respect to h is just the horizontal ones vector so let's just take the time
to replace these okay so now we have to worry about this how's the z change when we change s well that's not too bad
because this s is just so that's just basically s of h plus b so it's one right because we're trying to find s of
h with respect to this uh thing right so that really doesn't add anything all right so yep is this b zeros out and then we found an s of h so this should become
zero yeah i'm just confirming that so now we have one more interesting thing here and first of all let's multiply this out so this is gonna be a horizontal ones vector uh basically
multiplied by another vector of just the values of x so this is going to become a horizontal x vector
so it's basically going to look like this that's what it's going to look like but i'll just write out x transpose uh and this one can go away as
well so let's just move this whole thing and now here's something interesting so now we're dealing with this activation function and um there's obviously this activation
function can be a lot of different activation functions depends it could be tan h or relu or sigmoid or any kind of one of the many versions of sigmoid that
exists now um but let's just stick with relu because it's one of the more popular ones and also if you're reading the paper that i mentioned a couple of times by
uh uh um terence parr and uh and uh jeremy howard um they also use relu so you might be able to follow along a little better if you're uh using that paper as a secondary
reference so let's use relu so the funny thing with relu is that the so as opposed to sigmoid which is like concrete function i don't remember the exact sigmoid function
um but it's you know it's an exact function we the function for just this real is a bit weird it's the max of zero or z
where z is this so basically looking at the graph where this is z and this is the output of relu i don't know i put that
but um well really so it looks like this so basically any zero that is less than zero any z that's less than zero
is just clipped to zero it doesn't stay negative it just gets clipped to zero and anything that's non-negative so above zero is just itself so it's derivative it stays the same so just one times its
derivative up until that point whatever that is so so with that how can we take the derivative of something like this and also one more thing
uh just to mention that the derivative is actually it's on the the function of relu is actually not continuous the whole way through because it's not differentiable at this exact point here
right at z equals zero because as you might know from like calculus one the derivative at like sharp turning points is undefined you can't do it so it's
actually undefined at this point but this is solved really easily in many computer programs well first of all getting an exact zero like a 0.0000 and anything is extremely rare
but if you did happen to do that the computer would just add a really small like epsilon which would be like maybe zero zero zero zero one right so just to make sure that doesn't happen so really doesn't matter in
practice um so how do we take the derivative of a max well it's actually not that hard uh what we do is we actually take a
piecewise function almost so we we make two scenarios so if z is
less than or equal to one zero sorry and then we have another uh circumstance where if z is greater than zero so it's kind of nice case
so what do we do now well if z uh if z is less than zero or equal to zero then it's going to zero out everything our entire derivative is
going to become zero because just think of our chain rule thing this is gonna equal to zero so just gonna get rid of everything else so this here this this x transpose
that's kind of confusing this x transpose is basically we multiply you remember there's like three derivatives here and we multiply and combine with this thing so this represents how z changes with
respect to w so up until that point it's going to be 0 times this all right so if z is less than 0 then it
zeroes out everything but if z is greater than 0 then it just keeps it the same so it's b1 times this so now we can do some substitution so
this will just equal zero so that will just equal zero but then we can substitute this so this is going to be x transpose
all right so and then we can substitute z as well so what is this c actually i won't put this hat on my product thing because it's a little messy so i'll just say if w transpose x plus b is uh
less than or equal to zero or if w transpose x is uh greater than zero and that's it that's how our that's how the activation
is changed by w so now that we have that we just have to do the bias which is a lot easier
so the bias in the event i'll do it right above so just to kind of complete this bit um
this tells you that for some specific w how much it changes um with respect to whatever this is so this is this will end up being a gradient
uh with something like it'll you know it'll it'll it's a bit weird because the piecewise function but this just accurately tells you what happens uh with regards to any w so let's just
deal with biases now so a bias is just these two terms it's how z changes with b and looking at if you can remember what this looks like i'm just going to rewrite it really quickly
so that's going to be it so the b so how z changes so this is the how z change with respect to b is just one because there's this cancel out to zero so it'd be zero plus one
so the first part is going to be one that's easy so far and then we just use the same piecewise trick here because now we're taking a to z so i'll be this i don't know why i'm
putting max here it's really just this um it's our it's that our a respect to w equals this our a with respect to w
this will be b so it'll be 0 times this so 0 times 1 if w transpose x plus b is less than or equal to zero and it'll
be one if w this will be one times one so one times one if w transpose x plus b is larger than zero so we can just apply that to one
that's why that's zero that's a lot simpler uh so i just made this one mistake that i just want to take note of it's not too big but it's important to take note of um so back here remember when we weren't
really all the way done yet we have these we had here that one times well let's just rewrite it remember how this one times the change in um
what was it change in z when we changed v w uh and then over here we got zero times the change in v when you change w
well the thing is that this is uh x transpose right and that just equals x transpose and we got that but the thing is that this also
multiplies by x transpose so this actually becomes a vector of zeros transpose so i just want to make that clear
so this is going to be the entire vector x and it's going to be entire vector 0.
so that makes a little more sense when looking at the overall gradients of these functions because if w transpose x plus b equals uh is
zero or less for some certain activation then every the the derivative with respect to every weight will be zero this will all be zeroed out by the um
by relu but if if w transpose x plus b is larger than zero then the derivative with respect to each weight will be
its corresponding x finally enough so for example look like this so if you want to find oh how does um how does
how does how does the activation change when we change say example weight four it changes it by this amount that's the multiplier i guess you could say that um so that's the
derivative so for example this x4 is the derivative of the activation with respect to the fourth weight and if it's all zeros because it's been zeroed out by
relu then the fourth weight has the derivative of zero so i just want to mention that and also i want to apologize for this kind of weird vector notation here that was kind of wrong but i hope you kind of got the
point all right so um next on i'm going to move on to uh adding on the cost function and seeing how we can take the derivatives with respect to the cost function
and then exploring how we can use those derivatives to do gradient descent and find the optimal solutions all right see you there all right so now that we're moving on to
um not just dealing with the derivatives of the activation that with respect to uh the weights and the biases but now we're dealing with the cost uh with respect to the weights and the
biases so continuing with our example now we're going to have just one more derivative in our chain of derivatives so it's going to become an activation and then we can compare this activation
against the real total cost using the mean squared error function which i'll keep over here so our cost function not to forget it is 1 over 2
m and then all over something over all our training examples and i'll show you explain how we're going to represent training examples and then all the way up to m and then their order doesn't really matter
because you're squaring it so i think i might have said before um we're going to take our prediction and subtract the actual answer uh for now just for ease of derivation i'm going to switch it around so i'm going to take
the actual answer minus the predict answer activation so let's just kind of set those act invasion and then we square that
okay so that's going to be our mse so that's our it's going to be our cost function that we're going to be using just keep that in mind so so anyway so
now we calculated in the last uh part how a changes when you change the weight in the bias so we're all we're more than halfway there because if we think about it it's just
we have this chain of derivatives that we use the chain rule to compute and we just have to add one more kind of link to that chain which is that link between a
and c so our goal is to find how c changes do w as well as how c changes to b so we can do that we already have
we calculated in the last part how a changes when we change um how a changes when we change w and how a changes when we change b
and there are those weird piecewise things where uh it's we get a zero transpose vector if if w
transpose x plus b is less than or equal to zero and then we get a we get exactly the the we get exactly the derivative so
how w changes when we change um uh sorry how z changes when i change w uh if w transpose x plus b is greater than zero and then similarly
for the um for the bias we also have this weird piece thing where it's just zero a scalar if w transpose x plus b is less than zero and it's one if w
transpose x plus b is bigger than zero so we have those two peered piecewise things that are that represent this so we are already
halfway through the battle um because we already have that part that connects the w um and and that's kind of the hardest part because like if you think about it
as a chain almost as like a chain like literally the chain rule um but figure out how by figuring out how a relates to
w we had to go through this whole bunch of derivatives so let's say how a relates to w we had to go through all these you know these these different chains like we connected we first connected
a to w and then connected w to uh sorry so we had first connect uh how a changed z and then we had to change how uh z changed with that weird summation
thing and then we had to check how that weird summation thing went to our h so you can think of it almost as a chain literally um but we've kind of figured out most of this chain because we figured out
directly how uh z uh how a our activation changes with w and b so now we just have one more thing to our chain so this brings us almost all the way here this thing we already figured out
but now we need to connect that with the cost we just need to make this one last link in the chain so we can do is we can say um we can do this thing so we can say
the cost changes with a and looking at the we want always the same terms across and that works out so this is the only thing that we don't know that we need to find out is how the cost changes when we change the
activation so looking at our means great error this means that we have to take the derivative of our mean squared error with respect to our activation here
um so that might look tricky it's not too super tricky it's just a bit tedious um and then similarly with the bias we have to figure out how c changes with a exactly in the same way and then we just use the other piecewise
function um and then we get the exact same thing so that's all we have to figure out this one derivative so how this changes with respect to the activation so let's see if we can do that i might have to look at my notes
because it's pretty long process again this uh kind of pairs with um that paper i've been referencing by howard um
and uh par okay uh let's see if i can have space to keep that up there i'm gonna try i'm gonna probably run out of space and have to re-erase and you get it but okay so let's just write
out our um just the thing i think in uh uh howard and paris paper they didn't use the 2m i think they just used m i think but i'm going to just stick with two m's it's kind of what i'm more comfortable with
and i think it works out nicely because you factor out that too as you'll see so then we sum over this i
equals 1. so oh yeah one last thing i
equals 1. so oh yeah one last thing i just want to mention before going into this is how we express our training examples now because as you notice now we're
in our as part of our cost function we're summing over training examples uh because this m is how many training examples we have so for each training example we're we're taking the error and then we're averaging out for
a total error across our training examples so how we're just going to represent this is before we had um we just had one training example that was a vector
that goes from this down to xn so now we're going to extend this into a matrix where each column is a new training example uh and then i guess i'll signify this with uh the training example i put in the
superscript and then the variable number i'll put in the subscript but i really won't be using individual variables that much i'll just be kind of referring to it as a matrix and this
matrix with the training examples as its columns and the variables as its rows i will signify that with capital x and this doesn't need any superscript subscripts with just the single input
all right so just let's get that out of the way all right so we have this error so we have our actual answers y so each column of our x
associates with a right answer y so each of our training examples has like a right answer because this is you know supervised learning um so minus our activation so i'm not necessarily sure what
i should use for the activation i'm just gonna say so this is our final activation well in our simple neural network it's just it's the only activation but in a larger neural network as we'll explore right
after this with multiple layers um it would be a l but um i guess we might as well get used to
that notation so let's say al is the output of our um of our neural network that we're going to be comparing against the true answer
and then we square that cost so let's take some derivatives uh so is there anything that we can kind of simplify beforehand so we already know that um we already know all this we already know everything like this so we
can take one more um we can take one more thing out of this to kind of simplify it a little more so remember with the chain rule we're always trying to find intermediate functions and trying to simplify things
well this here is a pretty good candidate for something that we can kind of simplify let's say that y minus a l
yeah equals v yeah so this isn't super tough to get the derivative of um so really all we're doing is we're one we want to take the derivative of all of these with respect to the
weight since that's just the nature of the chain rule we can't go like halfway we want to be taking this with respect to the weight um so if we're going to make this equal to v you want to kind of
because we already have a with respect to w so we're kind of going outwards if you would say we would get we're getting closer to the costs so we need to find something in between
the cost with respect to um a so we're kind of finding a root in between here so now since assigning this little v thing here basically we're adding a
little another link in our chain now basically what we're doing is we're changing this from how a changes from w and then we're seeing um
then we're seeing how v changes when we change a and then we're making this how c changes when we change v so we just you see how we kind of added
another link in our chain by adding this uh new variable here that represents this so now we see that this is just v squared which is
actually a lot nicer to look at we kind of had to sacrifice by finding another derivative but this one isn't super hard to find so now that i'll just rewrite this in a
kind of a nicer way so we're trying to find the derivative of c with respect to w and then now we've added this new thing so uh so we're trying to find how c acts with v and we're seeing that v acts
as a and how a acts with w and we already have this uh we already have this so we can collapse this into how
v how v changes with w so that means we have to find of this we need to find the derivative of this v with respect to w
so we're going to take the derivative of this with respect to w um and what's that going to give us well this y is completely unrelated so that just becomes a 0 vector
and then this is going to become minus a l and using the chain rule that this al with respect to w is just going to be minus of um
a over w so this is it so basically the this equals that pretty much pretty simple
stuff so how v changed with respect to a is equal to how a respects to w just with a minus now so that's important to note so let's put that down somewhere let's put it down
right here so how v changes respect to w equals minus how uh a respect changes in respect to w all right
so that's good to think about and then we'll we'll uh we'll explore what it means to take the minus of a piecewise function uh later on it's really not as hard as you would think it is
um so we have this now we have this slightly simplified uh function now so now all we're trying to find is how c changes with respect to v
so what does this look like now so let's go to the next step so let's actually start computing this thing so we're going to now try to find how it goes with w still but it's just a little
easier it's right inside in a nice room fashion so that's what we're trying to do okay so let's go to the next step so we know that the derivative of the sum is the same as the sum of the derivatives and this right here
is just a number so we can just take that out so we can say one over two m and then sum over m and then i equals one and now we can move this in here
okay so now taking the derivative of this is going to be pretty easy pretty trivial so let's then do that
because we already have this so this of that squared is just going to be so now we're going to just have one more chain in there but this is going to be a super insignificant chain
because it's just using the power rule once so if you see that right now we have v of w we have what that is so to find what
v squared of w is just another chain rule thing so we need to find the derivative of v with respect to w which already have but then we need to find the derivative of v squared
which would just be 2v so we need to find both the derivative of v with respect to w and the recycle the derivative of v squared with respect to v
because it's like a chain rule thing it's just simple chain rule um so we can just do that so just to display what i mean i have to do is just this
and then this is going to be this so this is what we're doing now so kind of just to take the derivative of this we already have how v changes with respect to w but v squared is kind of another function outside of
that so we have to take the outer function so it's going to be how v changes with respect to how v squared changed with respect to v and then how v changes with respect to w so just using the chain rule to explain
that so this is easy this will be 2v and then yes and then this will be we'll just keep that like
that for now because we know that expanding this out it's going to be kind of a pain because it'll be all of this so let's just keep that unexpanded for now and see if we can do
anything before expanding that okay well one thing we
can do is we can take out this 2v i just forgot that one over 2m so we can actually take out that 2v uh well we could take out two and then i'll cancel that out
nice so we have this now so this is looking a lot nicer um so let's replace this quantity now finally we're going to bite the
bullet and do that so what is this going to look like also we're going to tackle this problem of what does the negative of this look like so let's just take it
with respect to the weights first because those are arguably a little more important so let's just replace that up here so one over m m i equals one let's write this nice and
pretty so now we multiply v times so when you take the negative of a piecewise function like here all you do is you make the options negative and when you think about that
that makes sense because if you're in a piecewise function and the psi function decides depending on these circumstances that you go this route it's no longer a
piecewise function you see what it's only a piecewise function to us because like it's like oh it could be this it could be this but if you are less than zero it's simply this so it'll just be negative that and if
you're looking at this it'll just simply just be that so you're going to have to take the negative that it's like it's not it's something special it's just like a kind of an if or an or for us to look at but for the actual
numbers themselves it'll just be like going through going through this one or going through this one they don't need to do any thinking or anything like that so we're just going to take the negatives of the option so the negative of
zero transpose which is the zero transpose and then the negative of z over w and then i don't really need to do this you're probably tired of these but you know it's good to
be thorough so there um all right so this is what we saw so far so all we can do is we can actually multiply this in and again using that same kind of intuition that a piecewise function isn't really a piecewise function to the numbers
themselves we can just multiply that in so v times that equals zero and i'm just going to get rid of that because it's all going to be zero anyways and then this is going to become minus
v times that all right so all we do now is we substitute really now it's just a game of substitution so we rewrite all this but this stays
the same and then now we substitute this derivative here um for the actual derivative which is x transpose and then we also we have
this v here remember we assigned this v to um the y minus the a l so we're going to have to expand that out and we're going to have to expand this al2
uh which is the w transpose x plus b thing and then it expand all that out and then you know you see where this is going so all right so let's further expand this out we multiplied the v in so
that's all good um yes so let's now expand this uh x transpose then so now it's going to be v x transpose and then i'll just do that
now you know the options right um and then let's go back up here and then expand out this uh expand out this v
i just forgot my thing so now it's going to be uh on the side still going to be zero but here it's going to be it's going to be
v x transpose negative v to x transpose so if we pull that out so v is obviously going to be y minus a l and then it's going to be times x
transpose just temporarily and if you're okay with this let me get rid of this mean squared error because we think we got that down by now um so what we can do is this sigmoid here
we're using the relu right so what this really is is max of that's kind of too close it's max
of 0 or w transpose x plus b because this is z so what we can do is okay let's just try replacing it just to see what happens so this a of l becomes this
so minus max of zero w transpose x plus b
and then x transpose and then we have our um i'm not sure if i did that right right okay and then we have our actually this
minus only applies to this part so i think one two three we're missing missing lots of brackets there we go so then we got that so this x transposes by
itself it's not affected by this minus directly um and then so then we solve our f um x uh w transpose x
plus b is greater than zero so these will come in handy now you might think i've been wasting time on these but these are about to come handy so when we look at this um taking the max of zero or w transpose
x plus b if w transpose x plus b is less than zero it would have been handled by this piecewise function anyways
you see there's no reason to have this max here because if it chose this option w transpose x plus b is greater than zero so this max function is kind of redundant because this piecewise function does the exact
same thing so we can actually get rid of this max and get rid of the zero and just kind of take it as this option means w transfer x is already bigger so we can just do this
and then this nice so we simplified a lot so this is a lot prettier now we can just do the fun little task of um
carrying this minus line over so that's going to become i don't trust myself on this one let's say minus 1 that becomes a plus there so everything in here stays a plus
so we can rearrange that in kind of a nicer manner so w trans x plus b minus y times x t so that can be our final answer there so we can
replace that with here so we're going to place this which is w transpose x plus b minus y and then this x transpose is still on the outside
and then one last thing that we can do is we can actually bring this summation inside and obviously summing over this doesn't make sense because we'll just end up with zero anyways
uh you actually end up with zero zero like actual zero but i should keep it like that so now let's bring this in the mono ram stays outside
so now we have our zero transpose t and then we have this little summation in here we just brought this inside uh obviously because we can because
using this exact same kind of idea that the number doesn't care of the piecewise function it just chooses one of them it's the same thing it doesn't matter if it's outside the thesis function it's
the exact same thing so then we can somewhat for this and then we're done so that's the derivation of the cost function with respect to the
weights and i'll explain shortly what this actually means because it might seem a little abstract of what even form is this in what does it mean and i'm about to explain that
so at the end of the day once we're done deriving uh basically the change in c
when we change some weight w what we want out of this is a jacobian or a gradient that we can feed to gradient descent to
help find the minima of the cost but the kind of thing that we want is something like this where we have this vector and
each entry in this vector is the so how the cost changes with respect to some weight one and then some weight two
and in our example with um five weights this weight matrix should be five members long but i'll just put generally
up to wn so in our case that would be up to five so we want this so we we want this to output this and now that we've taken the derivative we just
kind of want to double check that that is what we're getting so let's see so just to kind of understand a little better and just looking at this this is an
error term so w transpose x plus b that's kind of our prediction that's our prediction um that's our prediction a that's the outcome al and remember that we got rid of the
max because we found it redundant so that's that's the um that's prediction so a l minus the corresponding real answer for that example is the error
so let's just put the error for example i as that so let's rewrite this that's the actually no let's keep let's keep let's keep that actually
um and let's rewrite this over here so in the case where um w transpose x plus b is not less than zero because in that case our resultant
vector would just be zero zero zero it'd just be the zero vector and i know there's a bit of confusion saying zero transpose but just um just kind of disregard that and treat
that as we're kind of transposing it back into a vector of zeros so we've got that so that's in the zero case that's easy if w transpose x plus b is less than zero then
our gradients um with respect to the weights are zero because everything is zero the activation is zero there's no effect of the weights it's all zeroed out so we don't need to worry about that so let's worry about the non-zero okay so that's the hard
part so this part so substituting this for the error term here that's the error of i times x transpose so that's a lot easier to look at and let's don't forget about
this all right so this is a lot easier to think about so you can think of the error term as how obviously incorrect the answer is so looking at this x
transpose and looking at the sum so let's say that we only have one example so m equals 1. so in this case we only have one example we can just get rid of this and we can also get rid of this because this will be one over one which is just
one so we just have this so in the case of one example we just have our errors time our times our x transpose which is just
literally the inputs so in our five input scenario this is what it looks like so when we multiply our error on our one example what we're going to get
and i like to transpose this matrix just because our gradient that we put into gradient descent is usually shown as being vertical so let's just kind of cheat and
make this um make this make this vertical so x 1 x 2 x 3 x 4 x 5
and then we multiply our e so this error um so yeah then we multiply this error by each thing by each um x
sorry um so this error will be the same but this error is the same for every training example um yes so this i indexes training examples but since we just have
one training example we can just do that so this will be e x 1 e x 2 e x 3 e x 4
e x 5.
so this resultant vector here is how the cost changes with respect to um our weights no our output cost is not and the reason
why we have a vector is because the whole cost function is just one big vector to scalar problem remember how we talked at the beginning
how we get gradients in multi-variable functions that take vectors to scalars well that's exactly what happened we put in our five x's as our inputs and then we put it through this neural network
and it spit out a singular cost which is the mean squared error that's the mean squared error it's its mean of all the errors in something like this so
so here is how the c changes to w1 here is how c changes when w2 here let's see change when it's w3
here this this this is kind of a big moment for all of us w4 and then w5 so
exactly the the the cost is is affected by this amount when you the cost
when you change the cost when you change w1 by a small amount the corresponding change is this so we kind of achieved our goal of
knowing what happens when we change our weight how does our cost function change this is the answer and the reason why
uh grading descent works i'll go into uh shortly but let's just let's just see what happens when you have multiple training examples now so when you have multiple training
examples what we do is we are now extending this vector it's still a vector but now we're taking sums so i see so this took me a while to
understand when i first got it so let's see how well i can explain this so looking at this so let's go back to let's just erase this and go back to
our uh thing where we looked at this all right so notice that this x transpose which is our x1 x2 x3 x4
x5 um so literally the error part this is like all of those derivatives that we took all the product of all the chain rules how we
find in general the relationship we find is through the x's between the cost and the um the weights is through these x's
so that's interesting to note but to find the real gradient we have to multiply the error by each one and we'll see why that kind of makes sense soon
but if we have multiple so say we have n training examples well m obviously m training examples this x transpose is always going to be this because there's no index on this so
this will always be a a vector and we'd like to transpose this factor so that's what i'm going to do so just some x1 x2 dot x5
so this will stay the same each iteration but this is going to change because each training example is going to have a different error depending on the answer or the inputs you know all sorts of things so it's going to have a different
answer each time a different error each time but this stays the same and we want one vector so what we're going to get is we're going to get for each
derivative with respect to each weight which is what each of these rows are we're going to for each because of the summation
what we're basically going to be doing is this we're going to be taking the errors of the first training example say we
have three training examples we're going to be taking the error the first training example multiplying by x t by x one because we're going to have this x two vector and we're just gonna
do the exact same as before so we're gonna multiply this first error by x1 this first error by x2
this first error by x3 this first error by x for this first error by x5 and that's just the same thing as saying e1 times xt
right so then we're going to do that and then we're going to do it again so then we're going to plus e2 now because now we're on to the second iteration we're dealing with another error this is a different number than
this because now this is a new training example but then we're going to still multiply by the same exact same um the exact same
x transpose part sorry so you get the point i won't fill this out totally but we got that so then on our third journey example you get the point it's going to be e the third error and then it's going to be x1 and it's going
to all the way down to e 3 our third error with our x1 so we can just represent this in one kind of
long vector just saying e1 plus e2 x1 plus e3 x1 and then plus e2
x2 plus e three x two and then plus e and then you get the point e two x three plus e um
three x three and then that goes on and goes on and then whatever so and then one last thing is we don't want to forget this one over
m term so basically for each of these rows we take the average so we don't want this going all over the
place we don't want a huge number here so for each of these rows we're going to take the average so if there's three tanning examples we're going to divide this by three and divide this by three and then divide
this by three divide by three so then after averaging all the um all the things over the different training examples the errors time the times the x components what you'll get
is just a general number that i can no longer really associate with x t and with any of the values of x because now just an average so let's just call this uh just let's
call it g1 g2 g3 g4 g5 that give a general idea
of per these weights and over m training examples what's the general when you tweak um when you tweak this weight one by some
bit what is kind of the ratio of the corresponding change in the cost function that would be what g1 is so it's actually quite beautiful what we get from this whole process we get
this just this long vector that's as long as how many weights are in our going into our node into how adjusting each one of these weights
affects the the cost that comes up so i just want to hit on one more thing and that's this beautiful idea of gradient descent all right similar to what we did with the weight now let's do the exact same
thing so let's see if we can find the derivative of the cost with respect to the bias now it'll be quite similar so um so we're gonna do a similar thing where we segment the solvent to the intermediate
function v so we're going to say that y minus a l is going to be equal to v and using our same kind of chain rule idea to find the derivative
of c with respect to b we need to kind of chain it so we can start with the c but now we're going to be finding c respect to v and then we're going to be finding v
with respect to a l and then we already know how a l changes with respect to b so that's not going to be that hard so first we need to find this so um how to
find how v changes with respect to a l well that's just going to be minus a l and to find that other part of the chain
rule here uh multiply that by uh sorry minus one so how how v changes with respect to a l is going to be minus one so
we got one of ours we got this one so how v changed respect to a l equals minus 1.
and then we need to multiply this by how a l changes when we change b but we already know that so we need to multiply -1 so basically change in v
when change l and change in a l when change b is equal to is equal to negative 1 times
whatever our long and winding thing was so it would be so how al changes when we change b um that was it was either
it was either 0 or one right now let me just check yes um so it was how zero or one changes um when you change the thing so it's going to be this
so it's gonna be zero if w transfers x plus b is um obviously uh less than or equal to zero and then uh one uh because when you're thinking about or
w transpose x plus b uh when you're taking the derivative of that this is nothing to do with the b so that just can't float so then we just end up with a one so that's going to be one if w transpose
x plus b is greater than zero okay so that's decently easy so if we multiply our minus one in we'll get the exact same thing so minus one times zero is going to be zero
and then minus 1 times 1 is going to be -1 so not much of a change there so we can actually simplify this into this so the change in v when we change b
change in v when we change b is equal to this so now that we've covered these last two derivatives we only need to find how c changes when you change v all right so now let's take the derivative of our
cost function with respect to b just writing out our cost function over all the examples and then b squared
okay so our first step is we can say that the derivative of the sum is the same as the sum of the derivatives so we can put all of that in our average and then put this in here
now we can kind of use the chain rule to split this v squared up into two functions so to find the derivative of v squared we can find the derivative of v squared with respect to v and then the
derivative of v with respect to b so we can just use that exact same philosophy um let's just let's can we do it around
there without me redrawing things yeah sure so m i equals one and one over two m so now we're doing this so how v squared change with respect to v that's just the power rule so that's
gonna be two v so one over two m equals m i equals 1 2v
that's about 2 2 v and then this derivative thing here i multiplied by that so we can take out this 2.
and then we can put that over here and then that obviously cancels out so this all cancels out so it becomes just a pure average one over m and then we're left with v times this derivative so v
times this derivative of how um how v changes when we change b great so now what
well how v changes when we change b we already know that we computed that over here so what we can do is we can just substitute that
similar to what we did before negative one all right so now we can put this v in so v times zero is going to be
zero and v times this is going to be negative v and now we have to substitute negative v doing the similar thing what we did before so we have
m i equals one and then we have this so our zero case is uninteresting but what's our non-zero okay so i'm gonna make some space to make this a little more clear
so zero and then negative v so if this is if the w transpose x plus b is greater than zero so what is v so our v we assign to b as this right so let's
just substitute that so it'd be minus y minus a l and before we write well if the x plus b is greater than zero so before we
we simplify this can we expand this further yes we can so this a l is just the activation so we're subtracting
max of 0 and w transpose x plus b but as we saw with the weights we can do the exact same thing
max 0 is completely redundant because the whole reason for this piecewise function is figuring out if if w transpose x is greater or less than zero so this whole max thing becomes
redundant because if it's on this opportunity if it's chosen this path on the piecewise function then obviously w transpose x is already bigger so all we need to do is just
get rid of that and now just w transpose x plus b so now we can uh now we can distribute this negative so it'll be negative y
and then plus all of this and then we can just rearrange this to be nicer so we can put our w transpose x plus b first and then negate our y and that's going to be our final thing
so it's going to be zero if w transpose x is less than or equal to zero and it'll be uh w transpose x plus b
minus y so that's the error term if w transpose x is oops it's greater
and then we can obviously put this sigma and we can also put this mean i didn't put the mean uh inside the piecewise function on the other one uh i didn't really have an exact reason to i just never kind of did that so
you're allowed to put the mean in uh as well as the thing so uh while we do that as well um just to kind of get some practice looking at something like that i'm gonna erase this and go back up um so we can
actually put all of this inside the piecewise so zero uh if blah blah i'm gonna write it out i gotta write it down
every time x is less than zero but it's gonna be one over m going up to summing up to uh m
of w transpose x plus b minus y uh if w transpose x is uh greater than
don't wait it's greater than zero all right so that's our final answer uh just noticing a few differences between the the weights and the
the bias is that with our weights we notice that there is that x transpose term here now that's gone um why was that because
the derivative the derivative of um a l was different because before the derivative of all was x transpose and that kind of carried over and that became x transpose over there so now we don't
have that because the derivative the bias with respect to derivative of a transpose with a sorry the derivative of the activation with respect to b is going to be one so that's just going to be one um
so we end up with just this so this is a scalar um and that makes sense because in our simple neural network of just one neuron there's only one bias to tweak so that
will be a scalar so in a larger network where you have multiple layers you'll have another bias for each layer but that's still going to be kind of
grouped as a single scalar that you kind of add to each layer so the reason why this is going to become just a scalar so we're going to just over what we're going to do is you know
before we had our our grading where each one we added the examples if you have multiple examples it's just going to be so substituting this with our error term just to kind of get an idea
so error one times nothing so just our error so it's it's completely proportional to our error so the error on one is the derivative
and then we just sum that by the error on two so if we have three training examples then the derivative of how our entire cost changes when we change our bias will be
the sum of the errors divided by three so that'll be kind of the average so that's kind of something interesting to note is that when we're finding how our cost changes
with respect to our bias it's completely um proportional to the error on that layer so we're going to kind of explore this concept of error when we're doing back propagation and
we're going to kind of see i guess first why back propagation works but before that i want to quickly talk about why we're computing all these derivatives and i kind of hinted at it before but we're going to kind of go a little more
depth in gradient descent and the mathematical and graphical understanding of why gradient descent happened and how great and descent works and i have quite kind of an interesting
intuition that i'm looking forward to uh telling you guys so uh stick around for that anyways see you there so let me just if it's okay with you let me just
go back to the single training example uh scenario so we don't have to really worry about the averages uh the thing i'm about to tell you is completely applicable uh when we do those averages but it's a
little bit easier just to see an exact kind of thing being written out here so
let's try this again all right so this is a five dimensional vector
and so let's think of this graphically let's take a smaller vector that we can actually kind of graph so something two-dimensional
all right so this is a network with two weights attached to it so we can display this in a three-dimensional graph all right so
i may make a little more space for this because this is going to be pretty important so we're going to be graphing now e i
x one e i x two so these are the how the uh cost changes with respect to two weights and a network that looks something like this here with some weight one weight
two and an activation so that's what it would look like so that's exactly what we're going to be doing so this theoretically if you've ever been familiar with multivariable calculus or gradients
you've probably heard that the gradient of a function points in the direction of steepest ascent and this gives us a really beautiful kind of
intuition as to why this makes sense so i kind of stumbled upon this and i just kind of want to share it and i think it makes quite a lot of intuitive sense so let's imagine this graph so this is
our how we what value we give to our first weight and this is the value we give to our second weight and this is the cost so when we're trying to minimize the cost we're trying to get to the bottom of this but let's not worry
about that so the cost is represented by some sort of 3d like shape here i don't know what that is it's like a
curved blanket kind of shape so there's a really nice intuitive way you can look um at this vector and why it points in the direction of steepest ascent but first of all i just want to get the
notation right here so this is the gradient so we just want to get used to gradients in the innovation for gradients of the cost function with respect to the
weights so now we can start to get into why this points in the direction of steepest and what this means for us
trying to find the lowest cost so we can see vectors in multiple different ways depending on what field of science or mathematics you're in but something quite popular in physics
and a lot of time in linear algebra is viewing vectors as um as lines as arrows through space um where for example here in x and y in
the normal cartesian plane the direction of a vector can be described by its x coordinates so so just like a just like a point in a cartesian point and it's y coordinate
so given that you can travel from the origin to this place um so we can do the exact same thing on our 2d plane created by our w1 and our w2
on this graph so we can write something like this where we have the component of w1 being graphed here and the component of w2
being here and these components that we get for our arrow are sourced from this vector so the amount in the w1 direction is given by this and
the amount in the w2 direction is given by this there's just one more thing that i want to mention of course this isn't really about mathematical rigor um because we're just trying to
get a theoretical idea of why this makes sense and not just like get calculations right but there's just one more thing i want to mention so you don't think i'm doing anything illegal vectors as long
as they're the same direction and the same length you can move them anywhere in the plane so they don't have there by default they're attached to the origin but you could write if you did the exact
same vector and you just moved it if you just i guess kind of like shifted it you can think of it's uh that's completely fine as long as it describes the same direction and movement
i guess the same direction and length all right so what happens when we think of this as a line in this plane uh of course since this is two there's no way it can go up because that
would require a third coordinate so it just goes it just acts in this area so our idea is that this arrow
on this plane points towards the the direction where this cost ascends the most steeply so say maybe in this direction it goes up very steeply it's slope
well i guess it's it's almost it's a 2d slope um it's very steep so that means the steepest point point in this direction but why is that well we can get quite a nice intuitive
idea just by looking at this so one of the most important things we can notice and one of the most important observations we can get from deriving all of this
is that the costs and how much tweaking some weight affects the overall cost is extremely dependent on the input
associated with that weight you see that these ease are all going to be the same these these only vary from training example to training example they're the same for one training example and we're only dealing with one training example so this e is going to
be some constant number the thing that's going to change what makes some w1 affect the cost function more so increase the ratio between this and this
so basically if you tweak this this will be a bigger or smaller tweak is the value of the input so the value of the input is what decides the derivative or how much changing a
weight affects the cost so if this input is very large even a small change in this weight will cause a very large change in the overall cost um so
that can be a bad thing so you can just change it a little bit and you get a huge change in the cost as opposed to a smaller xos safe x2 with something smaller even maybe
below one then any multiple of error against this will cause a much smaller change in this so if we change so you can almost think of it as changing w2
is less important than changing w1 as you get more bang for your buck if you're changing w1 because even small little changes to w1 will cause large decreases in cost and
this is really the heart of optimization with gradient descent which i'll explain mathematically soon but that's really the heart of it we're trying to figure out what how we can get the most bang for our buck by
seeing the ones with the derivatives we can change the easiest with even small tweaks w often when training a neural network these ones with very small inputs will just kind of be dead like these
nodes won't really learn they might go up by some very small percentage but the thing is they don't really have a reason to because the whole way we evaluate how to of how to increase some weights and
decrease on weights is the overall cost of the network and if changing some weight barely changes the output cost we're not going to spend computational power changing that if we
can just change this by it's the same amount and see like a 40 times difference in the cost function so that's just something interesting note but that that connects to what
um what i'm meaning here so if you imagine this is an error as an arrow in space on this plane here when x1 is big that means that
the error is magnified so if x1 is big if the year even if the error is small the error can be quite large it can negatively affect the cost quite badly
so when you're creating this vector what's it going to what it's going to do is get it's going to point in the direction of x1 and x2 so if x1 is very large it's going
to point more in the direction of x1 sorry if x1 is very large it's going to point more induction of w1 so x1 is 100 and
x2 is 10. then imagine what w so s8 maybe the error is just one to keep it easy so that means w1 will point will be a hundred so that means
you know you know what that means this is the w1x so this is going to go on for a long time but maybe if this is only 10 we're only going to see a very small difference here so this is going to go on for a very long time but there's only going to be a very
slight change in this right so that kind of gives you an idea why it points in the direction it points in the direction of the weights that have the most
impact and therefore have the biggest cost because their error is magnified by this x1 and for example for this one has a very small
effect it has a very you know you know when you multiply the error by this you get a much smaller effect on the overall cost so it doesn't point in that direction as much because that doesn't give you
uh almost more bang for your buck or that that gives you well it's kind of a bad saying but it pointing in that direction wouldn't point you towards how to get to the steepest part of the cost
as possible pointing to the steepest part of the cost as possibly pointing in the direction of these magnified errors so that's just kind of an intuitive understanding but you might be asking
why do you want to know the direction of the highest cost we want to be heading in the direction of the lowest cost and the thing is that we can use the
direction of the highest cost to find the direction of the lower cost and the answer is you know quite beautifully simple it's just to
take the negative of the gradient and that pose that instead of instead of pointing in the direction where you get the mo you instead of pointing
in the direction where you get the highest increase in cost you get pointed in the direction where you get the lowest increase at lowest increase in cost
so you can slowly just kind of point in the direction of lowest cost as opposed to pointing the direction of highest cost and that gives you the foundation for
the formula of gradient descent so i'll slightly more in depth xy and gradient descent in the next clip but that's really the most important foundation is that we can use this gradient
to point in the direction on this graph give you kind of the pointer towards the the the direction of steepest descent
that's a magical idea and we're really going to use that whenever we use any sort of gradient descent every single optimization algorithm in machine learning uses this concept whether it be something more new like
atom or something that uses momentum or it all uses this idea of of of taking advantage of this
property that the gradient points in the direction of steep of the sun so i hope you're as excited as this as i am i think it's quite a beautiful idea and uh i'll delve into gradient
descent for a couple minutes after this so i'll see you at that so now that you've calculated all these gradients so you have this big lung vector with respect to all your
weights and your biases so all in this vector so you have your cost with respect to your first weight and your cost respect to your second weight and then you have your cross respect to
your biases and then you have all the way down to some final weight and or some final bias it really doesn't matter which order you put them in this gradient but then this gradient you
could say is the gradient of the cost function with respect to w and b or you can do it like like this there's multiple ways you can display this
i'm gonna go with this okay so that's your gradient that you calculate by doing that entire process of jacobians and chain rules that we covered
um and you combine it all in this nice pretty vector so all the information is in here so um let's see how we can use this gradient and kind of equipped with that that kind of
graphical understanding that the um the gradient points in the direction of steepest descent of our cost function so something like
this so it's points in the direction of how we can get c the highest as quick as possible um
it makes sense that our formula for gradient descent so this is obviously each one of these is some weight or some bias so if we have all our weights and biases
in one vector so we can unroll uh something that we do uh if we want to quickly talk about how the code works in a neural network is when we're about
to do gradients in the send we unroll we know we have all those weights matrices uh what we do is we basically turn all those weight matrices into just one super combined long vector that's the
exact same length as the gradient of the function so we have all our weights and our biases in this vector weights and biases then we have
and then we have our gradients and then those are the exact same length so we can do element-wise operations on those so the gradient descent algorithm uses this to its advantage
so the gradient descent algorithm is so let's what's a good let's use a capital theta to represent all weights and biases in our entire network
so uh cap so the new values so basically what we're doing is gradient descent is an iterative algorithm and you probably know this but i'm just kind of giving it in kind of the context of the neural
network and the mathematics of it um so green descent is an iterative algorithm so you don't just do it once and it finds a solution unless it's an extremely easy problem um or
you might never find the solution but the optimal the highest lowest possible error what you do is you keep doing the algorithm over and over again until your cost gets low and lower and the reason why it does this is so so
this is our let's say we randomly assign our thetas and weights our weights and biases at the beginning before training anything so our cost is going to be really high so we're going to do is we're going to
do this so we're going to take the same theta and we're going to we're going to update it by an amount so we're going to negative and i'm going to use alpha
let's draw better alpha than that negative alpha and then our gradient so all right i'm just going to get this out
of the way we know this is great i keep this reminder for sgt um so this is our algorithm for gradient descent it's deceptively simple or maybe it's not
maybe just simple but this is our algorithm for gradient descent um and we iterate this over and over again so basically what we're doing here is we have all our all our weights and biases and we update all these weights and
biases by subtracting some this is called a learning rate some learning rate times our gradient so this makes sense because uh we're basically taking the negative gradient like this ignoring this learning rate we're
taking the negative grading which now points in the direction of how we can how we can decrease the cost as quickly as possible and that's what we're looking for we're
looking for how we can adjust our weights and biases by the amount by kind of push push all our weights and biases in the direction of this arrow
that points towards a direction of steep of of lower cost the steepest quickest way to get to a lower cost so we're kind of shifting this arrow by
this we're shifting our um our arrow which might be pointing in some crazy direction and we're kind of shifting it closer towards the right direction and closer
towards a lower cost so by shifting it by this way of getting the lower cost we're just subtracting some element well we're subtracting
basically the element-wise thing is some w and then we subtract up by the derivative of cost with respect to w1 uh multiplied by some learning rate and then we're doing w2 minus the learning
rate so it's an element-wise operation now we do our biases and then we do all our weights so that's basically what we're doing here is we're subtracting
the corresponding um corresponding partial derivative from the weight and this will slowly kind of nudge it towards the right direction and hopefully getting closer and closer
down to the solution so this learning rate plays a critical role role in how quickly you make these steps you can think of as steps down how quickly you make these kind of edits
to our weights and biases and how big those edits are of course these might point in the right direction but if you make the learning rate too big you might over step and then now you're actually in a place with worse cost
because obviously this points in the direction of best of uh the direction where we can lose the cost as quickly as possible but of course these are just partial derivatives so when you're thinking partial
about derivatives in any way you're talking about a very small change so even though this might say oh this direction is good that's really measuring oh this measurement is good
if we move like an infinitesimally small like amount in that direction because this is still a partial derivative it's not it's it's a very small number um so multiplying this very small number by
a lot could be overstepping way too much and then you might be way past where that move was helpful so now you might be in a place of actually worse cost uh furthermore if you take a learning
rate that's way too small you will probably find the answer but it might take just way too long because all our steps and edits will be just very very tiny so finding that learning rate and it's
called a hyper parameter because uh well there are algorithms that are coming more and more that can help find the best learning rate and then obviously there's just brute force algorithms where you just
run the entire program on different learning rates and then just kind of like center it you just kind of close in on the best learning rate with the lowest cost um but there's different ways to find that but that's a hyper parameter in
that often the machine learning engineer or researcher has to kind of figure that out themselves in part some some a little
bit at least so yes so this is so this is that so in in the classic sense and the way that gradient descent
is kind of first thought is that we we do gradient descent over our entire so we have a training batch of maybe x examples so let's say well m
examples um let's say m equals a hundred so basically what you do is you calculate the average gradient over all these hundreds so they calculate this this this this this this
and then that that that and then that so then you have all these averages and then you get a single kind of average gradient that's taken the average of all 100 of those examples kind of that way we saw
before where it's like eix1 plus e e2 x1 and you know like in that way but for 100 training samples and then when you're saying we're seeing a thousand training examples
and you're thinking more and more and more and more and more um probably not that much but that gets computationally expensive because now you're basically running the algorithm
and computing the partial derivatives with respect to every single training example and that can just get to be a lot when you're doing dealing with that much data
so what probably the leading tactic now is although people might not be explicitly using a stochastic gradient center sgd as i put over here and people might be
using slightly more sophisticated algorithms the idea of turning uh batching up your training set into multiple smaller batches so we can
turn this x so if maybe say 100 000 that's 1 million but that works as well so let's say we have 1 million
training examples which is definitely definitely totally plausible for any data many days that's much more than that um so we have one million training examples
um so let's say we can batch this into maybe batches of maybe 120
why not so 120 so we can batch this into groups of however 1 million divided by 120 we'll have that many batches so we'll have each of these batches so
we can kind of segregate this as into um i'm not sure there's a specialized notation for it really doesn't matter for our purposes let's say there's some that's our first batch and then we have
some second batch so each of these is 120. uh training
examples just each of these so we have 1 million divided by 120 of these training examples actually i want hey google what's 1
million divided by 120 the answer is approximately 8 333.3333 all right so that many uh batches we
have about that many batches of these 120 so what we're going to do is we're going to do stochastic gradient descent on this batch and then we're going to do stochastic
gradients then on this batch and then we're going to do stochastic gradients on the next batch and on the next batch and on the next batch and what we're going to do is we're going to take remember how we would have to calculate
the partial derivatives with respect to all 1 million of our training examples just to take one step of gradient descent because we would have to create this 1 million long e 1
x 1 plus e 2 x 1 plus all the way down to e 1 million x 1 and then we would take the average over 1 million for that row and then we would do
the ones with so if you have tons of weights and tons of training examples this gets way too this just totally gets completely unreasonable when you're getting to large sets of data
and larger amounts of parameters um so what you have to start doing is you have to kind of start so obviously this is the most perfect way to do gradient descent you can think like this is the most mathematically perfect and
rigorous ways that we kind of we kind of check every single training example and we take it and we kind of take every training example into account when making our decision to okay what direction should we go but but
mini batch or not sorry mini batch is another word for it but um stochastic gradient descent says this is good enough so what we're going to do is instead of taking taking 1 million training examples to
take one step we're going to take 120 training examples and take one step based on that so now we'll only have 120 training examples in our batch and then we'll do that and we'll do that and
we'll do that and usually it's nearly the same as just doing you know completely vanilla taking everything perfect kind of for
the perfectionist approach now we're kind of just saying okay let's hope that these 120 examples uh represent the entire uh training set quite well
and that's why obviously we need to shuffle a shuffle our data before doing anything on it because of course if this is ordered data where you have some a a a and we have right answers all here
then then this is going to be all some answers so we need to shuffle very well uh before and making sure that we don't have any preconceived kind of unknown biases in
our sorting of the training data so in this perfectly shuffled or as close to perfectly randomized shuffling as possible we're going to
cut this 1 million to 8 33 different batches and then run our normal gradient descent on these smaller training sets taking one step each time so now we can
actually take 8 33 steps um where before we could only take one because before we'd have to we would have to compute this entire 1 million and then take one step now we
can take one step after completing each batch so that means we'll be able to take uh 8 33 so we were able to take 8 332 more steps by
using stochastic gradient descent um which you know obviously this becomes unreasonable very quickly
so just some kind of uh information on uh the naming of these uh it can be a bit confusing because there's a version of stochastic gradient descent that is originally
just called stochastic gradient descent that actually does this but does batches of size one so we're actually taking one training example
into account at each time um so our kind of our line to the lowest cost if you can think about it as look at this kind of contour plot and then this is kind of our lowest cost
if we're doing real stochastic grains or kind of like one example learning or we're just doing gradient descent on one example one example is going to be much less
representative of one million examples than even 120 will be so our kind of our path from if you can think about this contour plot as representing the costs
will kind of be much more jagged because when we kind of push this way by one example push this one but in general after many different times we should be able to get there um our stochastic gradient descent which i think the real
kind of rigorous term is mini batch gradient descent which makes it quite obvious but it can be quite odd but usually when people talk about stochastic greatness and they're talking about this mini batch using uh segmenting up into batches so if we
use this version then we get this kind of maybe slightly more uh jagged line but we get there fast if we did the entire perfectionist kind of view we would get there in kind of the most
prettiest line possible but it would take years well not years but it would take an insane amount of time so that's just a super quick rundown
on gradient descent hopefully already familiar to that but just kind of closing this picture of this our simple um neural network with five inputs
and how we can find the uh how we can find the optimal uh parameters and bias settings for that to lower the cost using stochastic gradient design so next up i'll try to explain how we
can get from this x1 x2 blah blah single node system to a system with many more nodes and seeing how that works and how we can do similar things to what we did with
one node with multiple nodes so i'll see you there all right so now that we've finished talking about the derivatives of the weights and the biases in a small neural network let's see how this works with slightly larger networks
so let's take something where we have three inputs and then these go to some three nodes so let's make a three layer three layer
neural network this goes to some four nodes and this goes to one node and this gets computed some sort of cost so we have three layers
one two three and then we have some three weights matrices weight one weight two and like three so how would we calculate
each one of these derivatives of the weights and the biases but you know if we managed to do the weights the biases come pretty easily um but with the cost with respect to the
biases and the weights well the thing that i kind of want to show is that it's pretty hard to do this when we're thinking about going forward in the way that we were kind of thinking
about before um yeah so it's quite it's quite cumbersome and and uh let's see why that is so theoretically if we're just kind of
using the simpler kind of chain rule intuition um if we wanted to find maybe the entire cost with respect to the first set of weights we would do something like
so we want to find how the cost changes when we change the first set of weights so then we would find how then we would find how we would start from here somewhere and would see
how maybe the first layer of activation so this is the first layer of activations this is the second layer of activation then this is the third activation so we would probably see how the first
layer of activation changes when we change the first weights and then we'd probably see maybe how the second layer of activations changes when we change the first layer
of activations and see how the third layer changes when we change the second layer and then see how the cost changes when we change the third layer
likewise we could see something like but then if we wanted to find the weights too then we'd have to change to how we would be doing this
so then we would start at how maybe a2 changes when we change the second weights and then we would see how
maybe the third layer axis change when we change the second layer and then how the cost changes would change um this third layer here but there's kind of a few problems of how i kind of
cumbersome this is of kind of going back and forth because you're going to be computing derivatives multiple times um because computationally these will still be different actual numbers um so you're going to be
multiplying different numbers with respect to the cost and you're also going to be for example here even though you have less derivatives to deal with when you've kind of
trying to compute the weights farther along the network um the actual numerical numbers of the derivative with respect with c with respect to w2 are dependent
on what a1 is which is dependent on what w1 is so then you're going to have to kind of calculate how does this affect w2 which is
also kind of a pain so you're no longer really taking the derivative you have to do some more chain rule magic to kind of see how these numbers affect w2 because remember in our kind of w
transpose x kind of uh derivations we see that in any layer the weights are kind of the weights are kind of parameterized but the weights are proportional and multiplied by the
inputs so we're going to try to figure out where these inputs come from doesn't come from there so just going to be kind of a mess in general when we're trying to
compute the derivatives forward it's possible but it's not done pro it's not really done that often so instead we use something called
back propagation which is the most probably the most famous algorithm in machine learning uh is
back propagation and how we go backwards from the cost through the network to calculate the derivatives with respect to feature or weights and the reason why this works a little better is that once we've calculated everything
we get our final cost and we can just kind of index one by one and see how this impacts our cost and then go just one step further and see how the impacts this
and how that impacts our cost and by going backwards we save a lot of efficiency and going we don't really need to go back and forth and back and forth because we can just as as we kind of
stroll through the network backwards we can kind of accumulate uh how that we can kind of accumulate oh this changes that so that means when we go to here that'll change that and that'll change
that and i'll go to here and we kind of accumulate over our way through the network we can just kind of through a single pass backwards to the network we can calculate all the derivatives we need because we can just kind of accumulate
them in these error terms so we're going to be introducing this kind of concept of the error of one node so some error of
some node in layer l and some some node j and some layer l and how we can kind of calculate basically from this one node how much
error is it causing in the cost function and how we can use that to uh parameterize and kind of uh help solve for the optimal amounts on our weights and our biases in a much
larger network and and kind of switching pace from this forward idea to this backward back propagation so next up we're going to talk about
um these four uh these four critical equations um and this idea of four critical equations comes from a man named michael nielsen michael
nielsen's book he put it quite uh quite succinctly with these four fundamental equations of back propagation which i'll go through and then i'll kind of extend on how
these uh how these work with jacobians and matrices near the end of this lecture so um so let's jump into those four
equations and help derive them and i won't show super super tough like numerical proofs for them but i'll hope to give some intuition as to why they make sense just by using the
chain rule and kind of filling in the gaps so yeah so be excited for this kind of last part where we look at the laws of back propagation and truly
understand how we back propagate through a network and how we decide exactly how each of the weights affect the cost and how we can put those into this final gradient so now we know our motivation for back
propagation we're finding trying to find this gradient of every weight and bias with respect to the cost so now we're kind of doing back propagation as how do we get there so we know why
but how do we get there so i'll see in the next part when we start talking about the rules of back propagation now that we're talking about back propagation it's really important to
mention this concept of the error of a single node so let's take a look at
some neural network all right so we have maybe three inputs let's all go into there
all right so the error of a node is choosing some node let's say this one so this is in the a1 layer so
let's say let's kind of build the graph for this one so choosing this node which could be represented as w transpose x plus b
like all nodes um so let's say maybe this node so that would be a 1
2 equals this so this specific node here if we were to add some small if we were to change this quality of this node by a
little bit how would the overall cost change uh there's just one more kind of specificity here we're not actually talking about a we're
not actually talking about adding a small quantity to the entire value of a itself for just mathematical reasons you'll get a similar answer using
using this uh method but what we're doing is they're actually adding a small change in the z of this so really we're talking about
adding a small change in the z 1 2 value of this node so basically the activation of this node before it goes to the activation function so
the z 1 2 of this node so if we add some small quantity so if we add some small quantity to this node how will the overall cost
function change so that's the idea of an error of a node is that so mathematically we can describe this as the error of a node
may be generally a node a the error of a node i guess i'm trying to find a good way to
say this the error of a node can be described with this symbol um with l and j
so the layer l um in the sum layer l the jth neuron layer l so this is so in our situation it'd be the error would be
1 and 2 is described by the change in the cost function when we change z of l j
so in our situation the error of this node 1 2 equals how equals how the cost changes when we
change z one two of this note so that's just an important thing to keep it keep in mind
and i guess one last question is why exactly this makes sense and why we call this the error of the node and just think about it like this if this
if this error derivative um c with respect to z l j is close to zero that means
changing the value of z has little effect on the final cost uh if in if alternatively
this has quite a large value changing the value of this creates a large change in the cost so kind of going back to
this bang for your buck idea if you wanted to decrease the cost you would want to take the negative of the of the derivative of this larger cost
and that would actually reduce the the cost by quite a bit or if you were somehow able to kind of alter if we know that altering this node by a little bit can decrease the cost
quite a lot we can say that this this node is actually causing a lot of error because this node is kind of magnifies anything that multiplies into it because it's such a large
derivative value it changes the cost function so much uh similar to how we were looking at um the the similar to how we were looking at the gradients
and how they point in the direction of steepest descent this is kind of a similar thing um but now we have these nodes with large derivatives kind of magnifying any error that we have
so we can actually call this really the error of a node or kind of a more specific way to call it would be the impact of a node on the cost or something like that but we can really
just shorten this to the error of the node because we can see from this total error c how much of this error is being caused by each of the nodes
so maybe all like you know you could almost imagine like a pie chart of c and how much of each of the how many ever parameters there are of
this cost is being caused by one of the nodes so that's the idea of an error of a node and this will be extremely important in
all four of our equations regarding back propagation and how we compute the compute the derivatives with respect to the weights because obviously the weights and the nodes are
very increment connected so this idea of the error of the node can help us derive the weights and the biases so let's go right into the first rule and see uh
how we can figure out the intuition of that the first rule is actually quite simple so uh we should have kind of a nice easy start so let's go to that
the reason why we're outlining these four central equations to back propagation is that they basically make up the all of the whole of back propagation like once we know these four equations
the steps of back propagation will become quite clear um so using this intuition of some uh nodes error
so some node l so node j in layer l is equal to how the cost changes when we change this node's z value uh so z
uh lj let's see if we can find the cost or the error of a node that is in the last layer of its network
so it's getting directly it's getting directly compared to with our real answer y so for example in our
neural network like this where we have our final answer here this final activation al and you know a l is actually just equal
to the sigmoid of zl um how does this overall cost which is kind of you know there's obviously more different types of variation on it but
some kind of version of the real answer minus of a l maybe squared or something that cost how does that cost change when
we change z so what we're trying to find is how does that cost change when we change the z of the last layer so there might be multiple in here
but any node in the last layer how does the cost change when you change the error of the last node change the last note
uh so let's try to figure this out so we can kind of intuit that we're looking for a capital l here so capital being like the last layer of our network and some
j and this can equal well we can almost we're still going to have to use a very small channel thing because the whole definition that we created of
the error of a node deals with the cost respect to some sort of z value and when you're at the end of a network you're comparing
the real answer with your a value not your z value so the a value is basically just a z going through a sigmoid function so we still need to
compute we're computing some sort of with respect to z terms so we have to compute some respect to a term first so to see how our cost changes when we change
a lj so some node in the some activation node in the final layer multiplied by how this activation node in the final layer changes with z
all right so simply enough this will just be the derivative of the activation function so it did and i won't explicitly put the derivative here
it could be relu it could be sigmoid but it'll be something and it'll be the derivative it'll be the derivative of that function with respect to zlj
and then we're going to multiply this by whatever cost function you're choosing maybe mse or something like that where you're comparing the real answer against the fake answer and squaring it
or some something like that so whatever that is we're taking it with respect to our output a l um so that's going to be the
derivative there so it's basically going to be the derivative of our cost function with respect to a l our final activation output so i'm just going to keep that there since there's really no need to compute that
we already kind of did compute that with our smaller neural network so it would be something like that and whatever this ends up being is our lj
so this as you notice is actually a component-wise kind of equation for this error of the last node because we're looking at specific nodes
j in the ending layer uh but really for back propagation we want to see a kind of more matrix based um notation so we want to be dealing with the entire layer
so let's exactly do that so let's use every let's find now the error of every node so now we're going to get a vector of errors so something like so all we need is
the layer so we're just looking for some errors of the layer and each of these is going to be the error of each node in that layer we're looking for this vector now so now we're looking for this
it's going to equal so the thing is if we put in this in this cost if we're taking the derivative of this cost with respect to all
of the nodes in the last layer of l of last layer a l we're taking the partial derivatives of the cost with respect to all the different a l nodes so what we're doing is we're
actually getting the gradient of the cost with respect to a before we were looking for the gradient of the cost with respect to w and b when we were doing back propagation or
not really back propagation but we're trying to derive the cost with respect to the weights and the biases but now we're just doing it with respect to all the values in al
so what we can do is we can actually take this this derivative and kind of display it notationally as a gradient because that's exactly what it is and then we can multiply this again by
the same thing the only difference here is we're going to just put l here so you can think of that as an element wise basically finding the derivative of
whatever our activation function is and then we're plugging each element from our z vector which is again just going to be
a vector of all the z values of the nodes in our final activation layer and we're just going to be element wise be applying this derivative of our activation to like c1
and then derivative activation to z2 and then you get the program so it's all going to be a kind of element-wise operations here this is going to be an element-wise multiplication between each um
each uh item in the the gradient and each item in whatever this vector ends up being which is going to be this thing and that's going to be our vector of error errors that we're going to use
so that's the first equation um in our matrix form uh we're going to be showing the next three equations all in this matrix form so it's all going to be social defining the error of some layer or later on the
last two the weights of some layer and the bias of some layer using these uh errors of layers so let's just dive into that after
uh yeah there's not much else to go on so about this one so let's go into the second equation which is the error of any node so if we have the so i'll explain the significance of that in the next
clip bye so now that we know how to find the error of the nodes in the final layer of our network how can we use this to find the error of any node in our network
well this formula tells us how if we have some layers of our neural network if we know the errors of the nodes in layer
l plus one we can use these to derive the errors in layer l so this is basically what it's telling us so using this formula we can back propagate through a network
and kind of cumulatively uh find the errors with respect to each of the nodes but if we know this one we know this one if you know this one then we know this one and this formula tells us how we know
that so this formula tells us that if we have the weights matrix of l plus one so that's a weight matrix that connects these two layers l plus one and l remember that the weights matrix um
syntax is that the superscript is the layer it's going into so if it's going to l plus one that means it's coming from l so the weights matrix that tells us what all these weights are
if we transpose that weight spectrum so if you flip it and multiply it by the vector of arrows of errors of l plus 1 so the three errors associated with these three nodes
and multiply that element-wise so we can show that as that if we multiply that element-wise by the activations of the z's in this layer
so the vector of z is in this layer so the ones that haven't been that haven't been touched by the activations yet
then we can get the errors so what we'll outcome is a vector of arrows uh which has the same amount of um entries as the amount of nodes in this
layer and each entry in this kind of error error of l matrix vector is the error associated with each node so
let's explore kind of two reasons why this works out one is kind of super intuitive uh not super intuitive in the sense that i this is the one i've seen before been
used to me doesn't really fill all the gaps in but then i have another thing that i haven't seen shown before that really made sense to me when i
first kind of learned i figured it out so um let's start with the first intuition um so when we multiply the errors
of a pre of a layer that's in front of the if we know the errors of layer l plus one we can think about it as when we multiply it by the transpose of the weights matrix that connects
these two layers we can almost think of it as instead of you know when we're feeding forward through a network we're multiplying this and we're multiplying uh in each node in each layer we're
multiplying the weights by the inputs to the nodes and then kind of going forward well now when we're multiplying the transpose of the weights matrix we can almost think of it as going in reverse
and going in reverse from these nodes and and back propagating the error to the layer before it and if you look at the dimensions it makes sense and if you look at how weights matrix just kind of some
hints on the intuition here the weights matrix the columns are the uh the k so you remember when we saw this j k l so the k is the columns that it came from and the j is the columns that it's
going into well the columns are the k so the columns are where it's um the columns are where it came from so if we flip this and we get this
thing where we now have the j and the k now the columns now the columns are where it's um where it's going into so when you're thinking about
multiplying that you're almost kind of reversing the process of feed feed forward so now you're kind of back propagating these errors through your network to the
layer before them and seeing and that's basically what intuitively what we're trying to do is we're trying to find the errors of this layer by kind of cumulating the errors of this layer and seeing how
these ones kind of go you know how these ones affect these ones and then we can find this one and then we multiply this activation of the zl because remember how
remember how the error is defined as how the cost changes respect the z and then if you're thinking chain rule wise we can't just go straight to the l a l
so we multi have to multiply by the uh the zl there but something that makes a lot more sense to me is the kind of this kind of chain rule intuition
this way you can fill the gaps of the chain rule uh in a way that makes sense so let's start off with a example and see if we can see if by trying to find the error in our example
we get something that completely matches this formula so let's take an example and i'm going to use the same notation that i told i talked about the beginning of this lecture where we're going to represent
the entire layers of a neural network just in this kind of horizontal way so let's say we have some weights matrix 1 so i'm going to use actual numbers here not variables and some
vector a0 of inputs so that could be said to be x but i'm going to say a0 so this is the first layer in the entire neural network and then plus the biases
which is also a vector so that's a matrix that's a vector and that's a vector this is representing an entire layer and now we can go to the second layer and see how we're using a new weights matrix or we're multiplying by the outputs of
this layer of nodes so a1 and then we're adding a new bias vector so here we have two layers of network kind of represented in this way
so how can we try to find the error of the layers the error error of the nodes in layer one that's going to be our goal
to see if we can find the errors of all the nodes in this layer and obviously we're still keeping in mind that it's the z we're looking for so z1 we're trying to find this
so using that definition of the error how can we kind of derive from the chain rule uh what's missing and you'll see what i mean by that
so what we're trying to find is some change in the cost with respect to z1 right so this is z1 this is z2
so all right so given the error of this so say we're given the error of the layer in front of it because we're back propagating so say we somehow already calculated the errors of
this layer so we have the change in c with respect to z 2. that looks like z z
2. that looks like z z z 2. somehow we have that how can we get
z 2. somehow we have that how can we get from this how can we you know how when you're doing the chain rule you have some f and how that changes with respect to x but then you can split that up into
how f changes respect to u and how u changes respect to x and then kind of cancel those out you can think about canceling those out and getting how f changed with respect to x now we can kind of do the same thing
so now we're seeing that we have this we have the cost in the numerator in the numerator how can we get z1 somewhere in the denominator so they can all cross out and we end up with this so that's kind of what we're doing we're
kind of doing some detective work and how can we fill these blanks how can how can we connect z2 how can we connect from how the cost changes in z2 how the cost changes when we
change z2 to how the costs change when you change z1 so we're going to try to think about what what change in z2 affects some sort of
change so that we can kind of go backwards in our network the best kind of link towards this and the most the obvious link is so we want to match this or we want to match
these um numerator denominators that can kind of cancel out and then we want to see how z2 changes the entirety
how z2 changes when we change a1 because a1 connects us back to here so we're trying to find a way to kind of kind of make these kind of chains links in our chain
so we can kind of get back to here so what we're trying to do is we're going to see how the cost changes with respect to z2 so that's the error we're given so this is equivalent to the error term of layer 2.
so we're given that and then we multiply with how z changes with respect to a so we're kind of chaining our way back and then now we're kind of back here so
now we can see how a 1 changes with respect to z1 and now we're here you see that we can cancel these out and
we're left with the change in c with respect to the change in z1 but let's not cancel that out just yet so this let me just kind of write this here this is the error in player 1.
so what are these and what happens when you multiply these out well let's make some space so um i'm trying to find a good place where i can make some space here
everything here is kind of important so it's tough uh i'm gonna get rid of these um kind of trust that you know these are the uh error terms i'll rewrite them right out right above so that's an error one
all right two so let's see if i can write make some space down here so um all right so what are these terms so we know that this is the error term uh of two that we're
given it's a bit small 2. so we know we're given that so we're always multiplying these right as a chain rule operation
so how does z2 change in terms of a1 well this is the only thing that you kind of need some um mathematical knowledge that i didn't really supply in this course but
how z2 changes respect to a1 so how z2 changes respect to a1 so how does this entire thing here this interior
change with respect to a1 so b2 is obviously out of the picture because it's addition and that has nothing to do with so that becomes 0.
so the thing connected to it is w2 which is a matrix this is a matrix vector product we're multiplying a matrix by a vector
and i don't really have the time to prove this but when you're taking a matrix vector product or a matrix matrix product where
you have some matrix a and some vector b the derivative of the entire um the derivative of the vector
b with respect to uh well sorry the derivative of the vector the the derivative of the matrix vector product
a b with respect to a oh sorry with respect to a1 which should be the vector
b is going to be a transpose um i really don't even have much good intuition as to why this works um there's some there's honestly
not even that much information online but there are a few proofs out there if you look for them um i found a few good articles that i'll try to link in the description as to why this makes sense but let's
just take it for granted and say that makes sense i do this rarely but we're really focused on neural networks here but let's just take that as a given that the derivative of matrix vector product with respect to
the vector is the matrix transposed so we can say that since this is a matrix vector product
um and we're taking the derivative with respect to the vector a1 our derivative is going to be w2 transpose okay
and then this a1 over z1 is pretty easy it's just how this entire function a1 changes when we change z1 so that's obviously a chain chain rule with to do with the activation function
so that's going to be how a1 changes with respect to z1 is just going to be the derivative of um our activation function whatever that is
uh with uh yeah the derivative of that with our z one inside there so that's what we're doing we're multiplying this these three
elements here we're multiplying and let's make some space for this and i think now we don't need this graph anymore it's pretty messy anyways so what we're doing is when we're i'm just going to write
the original thing here so we're trying to find the change in the cost when we change z1 which is the error in the layer one is equivalent to how we change so we kind of made this
chain so the change in a1 uh when we change z1 the change in z2 when we change
a one and the change in c when we change z one well now we found that these are entirely equivalent to a multiplication of these three things
the error into obviously and then the w2 transpose matrix and then uh the activations
uh the derivative of the activations with the um z1 in there and we multiply those three but where have we seen that before
let's just check what this means so this the order of these can be flipped around i believe so this is well this is an element-wise product whenever we're dealing with that it's
going to be an element-wise product so these produce a vector and that's in the right order yeah so we have our we have our weights
matrix times our errors our rates major type of errors and it doesn't matter if we do this before or after because it's an element-wise thing but we want to just do these first so
where have we seen this before we're trying to figure out the error in one so we're trying to so our l equals one so
we're multiplying the weights matrix of layer 2 which is l plus 1 and we're transposing it then we're
multiplying by the errors in the next layer which we also have then we do an element wise product
of the things in zl which is z1 so we have found that we automatically find this formula oopsies whenever we try
just by hand to use the chain rule to find the errors of our layer with respect to a layer layer that comes later so it's it's quite natural actually why
this comes to be and i think it's quite nice i think it's quite nice um kind of uh not really sure if it's a concrete proof but it's a nice idea
so that's it for our first equation um our second equation so that's kind of intuitive well keep the equation up there that's kind of intuitive way how we can
back propagate through a network and calculate all these errors just kind of cumulatively so if we know this one we if you know the errors of this layer we can know the layers of the ones before and then the one before and then the one
before and then we can go all the way to the beginning of our network and this kind of starting this is kind of starting to give you idea why we did back propagation in the first place and how easy this is obviously this isn't calculating the
weights and the biases just yet but you'll see very very soon that um the derivatives of the weights and the biases are just a small step away from calculating the derivatives of the nodes
themselves the the error so uh we'll explore that in the next two equations and um yeah and we're approaching the end of this lecture uh uh but we still have some important
things to say so stick around for that all right see you there all right so now we're on to the actual you know we're on to the third equation but now we're actually on to the
equations where we're solving for the weights and the biases or the cost uh with respect to any weight and bias so now this is big right these are the things that we're going to actually put
into our gradient descent algorithm to help solve or improve our neural network so you know we see the light at the end of the tunnel now um this this is what we're looking for
and then in the next equation equation four we'll see how we can find the derivative of the cost with respect to any weight but here we're doing the biases and luckily it's pretty easy uh the vector of biases so
often this vector of biases for a single layer so a single layer in a network so if we have four neurons in here this vector of biases for this layer
will be four long one for each node often this is the same for each layer so it might be like four four four and four they might all have a bias value of four you know doesn't really matter
but where we from this formula we see that this bias vector is equal to the error vector of that layer which will be so if the errors for each
node and that in this layer are going to be four then the biases are going to be four or vice versa so they're always going to be equal to each other and you might think uh wait shouldn't
the nodes uh aren't the nodes going to end up with different errors for each of them not really because they all kind of depend on the weights and as we've seen from the equations for the error it's actually in
one layer between the neurons they actually have quite similar error depending and they all kind of depend on the same inputs it's kind of hard to quantify in words but looking from the examples you'll know
what i mean to some extent but anyway so let's see why this is so i'm going to use a similar intuition as before as the previous equation kind of justifying that works kind of using this
chain rule kind of investigative work kind of uh finding the filling in the gaps of the chain rule but this one's gonna be actually considerably easier so let's do the same thing let's take an example and see if we can kind of find this
formula so let's take two uh layers of our network again kind of shown in this style so we have our w1 times our a0 plus our b1 and we're gonna take this to a second
layer and we're gonna put that through an activation now we're gonna have another layer a matrix of weights and we're gonna take the output of that and we're going to do that all right so
now we have our two layers so say we wanted to find the bias of say we want to say it really doesn't matter let's say we
want to find the bias of b1 or sorry this layer one you want to find the bias associated with a1 or this layer one or whatever you want to call it
the first layer of neurons so if we have two here you want to find the bias associated with here there equals 1. okay so how do we go about doing this well
we have assuming using the two equations before we were able to find the um using the two equations before we were able to find the
uh all the errors with respect to any uh layer but you know remember had equation one which figured out basically kind of the first error after the cost and then we could use equation two to solve for the error for
every single other layer so now assuming that we have the errors for every layer so that means we have the error for if we're trying to find the bias associated to a1 we have the error already associated with a1 so we already
have the error associated with a1 so we already have this value whatever that vector is and this vector obviously as we've kind
of drilled as i've kind of drilled into your head by now this vector is completely equal to how z one i don't really know if superscript or subtrip anymore
let's say subscript this is exactly equal to how let's do superscript it's exactly how the cost uh with respect to the z1 here so these terms inside here um
changes that's exactly equal to the error so um so we can just do some chain rule work so if we are given this error term uh um
yeah if we're given that error term then and we want to find how the cost changes with respect to the bias of one what do we multiply this with well it's
not too hard um how does z1 um affect b1 or how does b1 affect z1 well that's kind of exactly the question we
want to be asking so how does z1 how does z1 affect b1 and we're done so we cross we cross these out and we get this and we get how the cost
changes with respect to the bias one um and that works out quite nicely and that's it really um so that's that's
kind of our very short chain uh so all we need is this term here how z1 affects b1 but we notice something let's look at how well so let's kind of rewrite this as a bin
let's see all right so here so we have our w1 a0 plus b1 so
this entire thing here is z1 this everything except for this activation so when we're taking this with respect to b1 which is what we're trying to find we're trying to find
these two terms z1 and b1 and c with respect to z1 but we already have this this is the error term of one so we're just trying to find this so when we're trying to find this
with respect to b1 we're going to have this cancels out this just becomes zero because there's no b one there so we just are left with b one obviously if you're taking the derivative of b one respect to b one that's gonna be one so we're gonna get
is gonna get 1 times the error which is the error and that's exactly what we have here so to find b1
we have to multiply the error times 1 and then that returns our error so it's exactly proportionate uh because
the z1 with respect to b1 is one so when we do our chain rule that becomes one we multiply it by our error and we just get returned with this error so that's kind of the proof for why that works so hopefully you follow along with
that um i think it makes quite a lot of sense uh i'm not sure if these kind of chain rule intuitions are working for you they work for me but you know it depends on person to person so there's a lot of different
intuitions out there if you want a more solid mathematical proof i'd definitely recommend checking out um michael nielsen's um book uh on all of these which i'll it's
online free book that he put on the creative commons i definitely recommend you donate to him um because he's putting a lot of good stuff but um i'll put a link to his book in there so if you want if you want a few more mathematical proofs so the first two
equations um a bit more uh explaining but this is kind of stuff that might not be in there and i'm just wondering uh just kind of seeing
uh how that works out so this is uh yes this is the intuition for equation three so next we'll hop into finding the intuition from the last equation equation four so i'll see you there
hi all right so let's take a little time to talk about equation four the reason why i'm different wearing different clothes is because uh this footage got deleted so we're gonna have to refill or i'm gonna have to refill
um so anyways so let's talk about equation four which is kind of the most important equation uh it finds the derivative of the cost respect to any weight that we want so looking at this equation here there's
one thing that you might notice uh it's that this isn't a vector equation what we're getting is a scalar so we're inputting in the scalar for the error of layer l uh node j in layer l and then
we're multiplying that by the activation uh the the kth activation node in layer l minus one and then that gives us the derivative of the cost
respect to the weight um jk kind of paying attention to those so for example so first of all i just want to say that after this we're going to be trying to vectorize this formula
uh into something that can be vectors just like the other three uh equations that we looked at but we first need to understand the scalar formula and why it works and we're going to be applying this and trying to create
a vector formula in the next part but let's first try to see why this formula works so let's let's take an example network here
sorry my pen is like dying all right so this is two layers in our neural network i won't fold these in because i don't want to keep it too i don't really want to clutter it because we're
going to need this so let's say that we're trying to find that how changing this single so let's okay this is going to be layer l minus 1 this is going to be a layer l so we're trying
to find how changing this weight will affect the cost further down the road so let's say this is w just like w not no
news no superscripts no subscripts just w so we're trying to find how cost changes um when we change this specific w here which is just one
so how do we index weights so we want to find some w j k l so this formula gives us the derivative of the cost when changing a single weight so this is a scalar derivative
right um it tells us how the cost change is going to change a single weight in layer l connecting the kth node in this layer to the jth node in this layer so what's going to keep it generic and say
this this weight w connects nodes k to j all right
great so so this is w j k in layer l so we're trying to find how that changes the cost so we can kind of think about this slightly
differently we can think about this in terms of what do we have already we have this error term of how the cost changes when we change the
error in layer l on node c g j so we have the derivative of how z on this node how if we change the z on this node how the cost changes
so we have the we have how the cost changes when we change um z l j this marker sucks
we have the cost and how how it changes when we change the z of this term so if we're thinking about the chain rule and we're trying to find how the cost
changes when we change w j a w j k l we already have how the cost gets tied in further down the road
in the form of our error term uh error of lj right that's what this is luckily we have using the first and second formulas so we have
how z of l j changes but we want to find how z of l j so how does this one change when we change this weight which is
going to be weight j k l right so if we cancel those out then we able to then we're able to get our formula here
so let's look at let's look at the z of this node here and see how that changes with respect to the specific weight here right so let's just look closely and before we were looking at layers now
we're looking at single neurons so how does the single neuron interact with this weight here well this weight this neuron here so let's just say z you know which i'm
talking we're talking about this c the z of this neuron here how does this z just give me a second i gotta shake this
how does this z change when we uh so this z we don't have to worry about the activation function since we just have the z so we let's say this activation is a this activation v this activation c and sun's k right so
abc so we learned that a neuron takes the dot products of all the weights going into it and the um and the outputs of the layer before so we're going to see let's say this is w1
w2 w3 we're just trying to find w with no subscripts so let's say so basically what it does is it multiplies a times w1
plus b times w2 plus c times w3 and then plus our special thing here plus k
times w times rw here right and then adds a bias term so that's what we have that's really what's happening in the dot product when this takes the doppler has all the
weights in the inputs so what we're really doing when we take z this is really zjl right when we're trying to take the derivative of this with respect to our specific weight here
we're really taking the derivative of this entire z with respect to this weight here so all of this crosses out because that's not any way really so these are just zero plus zero and then this also across the
opposite bias is not there so all we're left with is kw and obviously when you're taking the derivative of this thing when you're taking the derivative of this with respect to w
this becomes 1 and we're left with k but what is k k is this right k is the activation
in layer l minus one in node k so we're doing is we're multiplying this because that's the derivative that's what this derivative here is
it's a l minus 1 k and we're multiplying it by whatever this is and this is our error in l j so that so that kind of gives you the intuition
as to why this works um we're just kind of breaking down a single neuron and we're seeing its house operations basically force us to use this equation so that gives us the equation for
a single scalar neuron but how can that help us think about entire vectors so how can we find the weights matrix how can we find a matrix that's the same size as the
weights matrix that gives us the derivatives of each weight in the weights matrix at a cost with respect to each weights in the weights matrix that's what we're going to look at after this and hopefully got that down and we're
gonna be moving on to vectorizing this formula so see you there so just one more thing about equation four and you might have noticed this when we were looking at all these three equations in
comparison and that's pretty big so remember our first equation that was to find the layer of errors the last layer of arrows so we were looking for this so this vector
which is the length of the amount of nodes in the final layer and then we found a formula that could tell us the vector of errors in any layer and then we found the formula to
tell us the derivative of the cost with respect to the biases of any layer so notice how these three are all vectors
and to to find them we all did vector operations and matrix operations we're dealing with the language of vectors and matrices but here
our final formula we use scalars so to find a specific weight we use this scalar so we're looking for the kth activation and
layer l minus one and then the jth error so the node of uh the node of the error of node j um in layer l so these are both scalars and what we get
back is the exact derivative scalar a single number which is the entire cost with respect to a single weight that goes from the k-th neuron in layer l minus one
to some j-th neuron and layer l so j to k so what's the deal well um through some kind of work we can try
to find some sort of formula that's equivalent to this but is vectorized so we're dealing with vectors and we're looking for uh the derivative of the cost with
respect to all weights in a layer but it's a little different with weights uh remember how we represented weights with the weights matrix uh wl or wl i keep forgetting where if i
if i'm using superscripts or subscripts anyway so some wl um describes some uh some matrix of all the weights so if we so let's kind of have let's have a running example that can help us try to
find um this this formula so let's say we have we have three nodes here and three nodes here and then we have all our connections in
between them so the amount of connections in total the amount of weights there's going to be nine three times three um and then the shape of our weights matrix is also going to be 3 times 3.
and then more importantly as we saw the columns of our so this is layer l minus 1 and this is layer l as more importantly as we saw the columns of our weights matrix
the columns of our weights matrix is the nodes it's coming from and then the rows of our weights matrix is the node it's going into
so something here at the k-th uh column and the j-th row is coming from the k-th neuron and k-th neuron in layer l minus one and going to the j-th neuron and layer l
okay so this is i'm just going to do this just to kind of keep track so these are our js and these are our k's okay so this is the weights matrix so our derivative of the cost with respect
to the weights matrices should be something with the exact same dimensions of this so we right now this is our weight one one weight one two weight one three dot dot wait let's just fill it in just kind of for completeness so this is gonna
be all nine of the weights we're gonna cover all nine of the weights wait two one wait two two wait two three and wait three one weight three two and weight three three
um so yeah just by looking at the index of one of these so w12 for example we know that this is going from the second
is going from the second node in this layer going to the first node here so it'd be the connection going from here to here so that would be weight one two and the one two position nice so what we want is something like
this but how let's see if i can i can't really edit it but we're looking for some uh change in cost with respect to wl which is going to be well it's not
necessarily yeah wl and it's going to be equivalent to the change in the cost when you change weight one one and the change in the cost when we change weight one two and the change in the cost when you
change weight one three you get the point so basically a matrix of derivatives that's the same size of this which tells us how each of these
um changes the overcall so the cost so that'll kind of finish our picture of uh vectorizing all of these four equations so in our final part we
can kind of look about look how we can actually do the back propagation algorithm with vectors because when we're actually doing backwards in a neural network we're not dealing with single numbers we're dealing with vectors because it's a lot more
efficient how can we find this uh big it's not necessarily a jacobian but it's just a matrix of derivatives of c with respect to some wl so how do we
find this uh and how do we come up with a formula that's like this but calculates all of these at the same time all the weights at the same time for one thing well it shouldn't be that hard and kind
of let's first off kind of see what our matrix should look like and then we can try to find some maybe vector product or something that can get this there so what should our weights so we're
using this example here and let's put some concrete numbers because it's getting a little confusing so this is layer one this is layer two that should cover it so we're trying to find
uh w one the co we're trying to find we're trying to find how the cost changes when we change these weights so that'd be the weights matrix signified
with w2 yes because the weights matrix label is the layer it's going to so how the cost change when you change all the weights in w2 so we want a matrix like this
so using this rule let's just kind of hand compute for each of the nine weights what these are going to be and put them in this matrix here so with weight 1 1 using this scalar
formula what will the uh what will what will the derivative of the cost respect to 1 w weight 1 1 will be so just looking at this right so um
whatever it's gonna be it's gonna be a product of two scalars the activation of l minus one so l minus one is always going to be one so we can kind of always fill that in uh so in all our things it's always
gonna be a1 a1 and then literally in everything because in all of these formulas we're going to be multiplying by the activations of this layer um it
doesn't like our the num the activation number of the neuron in this layer might change but this will always be multiplying the activations of l1 uh it doesn't really matter which one you start with but let's just start with
one so let's start with weight one one so to find weight one one uh we can just plug into this formula so j equals one and then k equals one so it's going from
the first non here to here so what's the derivative of this weight with the cost with respect to the derivative of this weight so that's going from k to j k equals one to j equals one
so let's just plug it into our formula so let's set this k equals one and this j equals one um and then that's literally all we need to know so we're going to be multiplying a
1 1 by the error in um yeah by the error in uh always going to be layer two here because we're looking for way two and then l equal that
this is going to be l minus one so it'll be one and then now we're looking for two which is gonna be two uh so it's gonna be the arrow two and then the first term there because we're just plugging all our numbers in so l equals two
for all of these and then always l minus one is gonna be one then in our specific case here j equals one k equals one so let's uh remove these and just kind
of put these back here okay so okay that's our first way we we got one weight so now let's kind of speed through the other one so now in this next one j equals j equals one but k equals two
so that means it's coming from the second neuron and going to the first one so it's this one so now what we're doing is we're going to multiply um a1 will be so a1 so k is going to be here so it's going to be a2
but we're going to be using the same error to 1 because our j is still 1 so our j is going to be here is going to be 1 and our l is going to be 2. um so yeah there we go and then we're
2. um so yeah there we go and then we're going to put this here and then we can kind of start to see a pattern start to speed up so we're gonna see along this first row we're just gonna be multiplying by the exact same
error term and that kind of makes sense because this error term is associated with this neuron here and everything in this row is going to be the weights going into this neuron so it kind of makes sense so in this
layer all the errors are going to be the error of this node which makes sense because we're always we're going into that node uh but these but these activations are going to change so there's going to be the activation of this one first a11 and then it's going to be the activation
of this one second it's going to be a 1 2 and that's going to be the activation this one should be a13 so now we can kind of fill them in so now we're going now in our next layer of the weights matrices matrix we're going into
this node and now we're cycling over these three again so now our a1s are going to cycle again so a 1 1 a 1 2 a 1 3 but then now our all our errors are going to be with respect to
this node so they're going to be a 2 2 and then i mean error of error of layer 2 node 2 error of layer 2 node
2 and all these are going to be the same because they're all going to this neuron now and now again here we can predict again that we're going to cycle through these so it's going to be a11 a12 a13 and then it's going to be the error
of now 2 and going to this node so it's going to be 2 3 and then 2 3 2 3.
so we have this matrix of partial derivatives and now this tells us how each of these weights affects the total cost if we are given all the errors of the if we're given a
vector of all the errors in these terms and we're forgiven the activations all these terms and obviously these activations are computed during forward propagation so already have these
and using equations 1 and 2 we can get the errors of these so we have all so assuming that we've kind of calculated all our errors we have all the rest all the things we need to compute this matrix so how can we
compute this matrix as a vector vector product or something like that how can we make this a matrix in one step uh well we can think of
our errors of uh layer uh two being a vector that we calculate using our second equation so it'll be the it'll be the
error of layer two node one layer two and node two and then layer two and node three so it'll be the errors of this this and this and then we have another vector we have um a uh
two so all the activations in layer at all a1 so all the activations in layer one so that's going to be a 1 1 a 1
2 a 1 3. all right so how can we multiply these two active these two vectors that we're given during back propagation and forward propagation to create this matrix because then we
can assemble that to make a formula that gives us matrix automatically so kind of toying around uh obviously uh when we're thinking about the when we're thinking about when we think
about multiplying vectors to create a matrix obviously when we do some a transpose b what we're going to get is we're going to get a dot product if a and b are the same length
but say a is say a equals the length well let's use the actual length so this error vector is uh dimension 3 1. this one's also 3 1.
so if we transpose one so it's going to be tr transfers one and then we transpose another if we transpose our first one it's gonna be one three so if you know how matrix uh when we
multiply matrices the uh this one and this one have to be the same and then this one this one the inner things have to be the same when we multiply two matrices and then our resultant is the outside and the
outside dimensions so it'll be one one so when we're doing a dot product like this our answers can be 1 1 but we don't want a single number we want 9 numbers so what we're going to do is we're going to transpose them in a different order
we're going to do a b transpose so we're going to have some three by one and then we'll transpose one to be one by three and then these will cancel out and we'll
end up with a three by three matrix so it'll be some sort of a b transpose or it'll be some sort of uh so we have kind of two candidates for what it can be we can have some the
vector of arrow two times a one transpose this will be a three by three vector or it can be
uh the uh the a ones and then uh times the transpose of the errors which can just be the other way around but still a three by three matrix so
let's see which one of these works so um let's erase this so both these will be three by mate three major so let's try this one i i
generally actually don't know which one it was because i forgot um so let's test it out so let's say we don't transpose our errors so this is gonna still
stay the same so the two one two two two three and we're going to transpose our activation so we'll have a one one uh a i don't think i need this
anymore a11 a12 a13 okay so multiplying this uh we're going to be doing rows times columns but our rows and
columns are just are very small so our row time this is going to be our first thing is going to be a 1 1 and then error 2.
let's do it in the same order and it's going to be the error 2 1 times a 1 1 and then multiply this by this we're going to get the error in 2 1 and a 1 2. and already looking at
this this is wrong because actually no no no no it's right that's right sorry no it's completely right um because we can multiply this by this and we get the exact same a two one
error of two one times a one three and then you'll see that we follow the exact same pattern so we get our errors are the same across the rows errors are same across the rows
and activations are different across the columns activations are different across the columns and you'll see that when you multiply this by this we can get two it's getting sloppy 2 and then
2 and then you're going to get a11 again and then you see the pattern so when we multiply this we get this matrix so let's write this out so
what we get is this vector formula this vectorized version of this equation and that's equivalent to so the cost of the the derivative of the cost with respect to the weights matrix of
any layer l is equal to looking at this here the error of layer l this is a vector and then we're going to multiply that by
the uh the activations of layer l minus 1 and we're going to take the transpose of that we're going to multiply these two so it's going to be a matrix so this is some
n by 1 and this is going to be some 1 by m where in our network the amount of nodes in
layer l minus 1 is going to be m and n and this matches our dimensions of what we expect our wave mix to be so this ones will cancel out and we'll get an n by m
matrix we're going to have the first dimension or matrix is going to be the amount of nodes here and the second is going to be the amount of notes here which matches up with what we expect so i guess to be completely consistent i
would write this over these two quantities but i think that's taken so this is our new formula we're going to be working with so now we have a vectorized version of
all the formulas and let's get rid of this now because we don't need to worry about this so this is our new formula for finding the derivative of the weight so the matrix of the derivative of all weights of any layer with respect to
the cost or the other way around the cost of the spectating weights so this will be extremely powerful and we'll scan to the last part where we actually explore the back propagation algorithm and how we use these
four equations in uh whatever order to uh find the derivatives of our network to plug into gradient design all right i'll see you there so you might have noticed that we've kind of looked at finding the
derivatives of the cost with respect to the weights and the biases in two different ways um one kind of with our small network where we had our some x1 down to some x5 inputs and then our
one neuron and then comparing the output of that neuron so the activation of that neuron with some real answer so a l minus y hat right so this is what we're doing before and comparing that with our cost
function and then finding the derivatives using our knowledge of jacobians and that sort of thing and then we kind of pivoted to talking about back propagation and the four rules of black propagation and then more specifically these two
rules that help us find the cost with respect to the weights and the cost respect to the biases um so let's try to see if we can find some sort of connections between these
two kind of ways of looking at finding the derivatives which kind of gives us a little more insight into why we use back propagation to kind of
do this thing we did in the small network but in a more efficient way so let's rewrite our two derivatives uh with our small network
computing how the cost changes with respect to the weights and how the cost would change with respect to the price so let's see if i remember this um so we had two piecewise there were both piecewise functions right this is dying i've
literally gone through so many markers in these lectures um it's my last one um so we had let's say for the bias so the cost with respect to the bias so that
was either a zero a scalar i believe or um or the uh i have it written down sorry i gotta cheat
um or yeah so we can factor in the uh the summation and the average so we have our one over m and then we have our summation in here um so summing over m training examples and
then we had the uh w transpose x kind of term uh which what was the explicit way that i wrote it so just the weights of that layer uh transpose
x plus b and then we had our minus y right um minus y um so this is for a single term so this is just the weights so this is still for just a single layer so we don't really need to worry too
much about indexing and then we didn't have anything outside of that right yeah okay we didn't so this was uh when w transpose x
plus b is less than or equal to zero and this is when you know when i was graded there you get that all right so then here we can have our second derivative which is how our
cost change with respect to our weights which is a little more complex marginally just a little bit more um we have to write with these okay so
now it's either a zero vector um i think it's a transpose zero factor right and then so that's when what w transpose x plus b is less than or equal to but then more interestingly we have
our weights um which is basically the same for the first part oh that's really bad there we go um i just need to search live like a little
bit more um and then w transpose oh there we go i found w transpose x plus b minus y and then we
multiply that by our x transpose right and that kind of created this this um this vector of arrows right so this was our rules for the cost respect of the weights and the
classification of the biases okay so how can we kind of find some uh similarities between these and these so first let's turn so this as we kind of speaked about
before is basically the um though this is basically an error term right um yeah it's basically an error term because it compares the output which is this w transpose x plus b so that's our output
um and then obviously we don't have to worry about the max activation function because that was dealt by the piecewise part of the function um so yes the w transfer x plus b is the output of our
algorithm so that's basically a l and then we compare that with the actual answer y so that gives us the error between our prediction and the actual value so that gives our e i our error similar here we can do the
exact same thing so let's rewrite them in terms of the error really shake this write these in turn rewrite these in terms of the error that's kind of confusing w transpose x
plus b minus y terms they're completely equivalent so let's do that so here let's just say for the biases we sum over
um all the errors of each i and then here we have we multiply the errors by x transpose okay so we got that and to further
simplify this let's just look at single training example it's just a lot easier so let's do that so that's m equals one so that gets rid of all that so
and furthermore just to simplify this even further let's just look at the case where w transpose x plus b is greater than zero because if it's not that just we're just going to be locating vectors of zero so
now we have in our very simplified world we have the cost respect to the b hopefully you can see this i'm pushing very hard
um is the error i just say the error and then the cost with respect to the weights
is the error multiplied by x transpose but where did we see these do they seem a little familiar well they should
because these are these these are completely equivalent to each other um yeah so that's kind of uh you know these are completely uh equivalent to each other uh here we
have um our error it's our error and here is our error times x transpose um and yeah so the only thing here is the x transpose here yeah the reason why this is x transit
was not a l minus one transpose is because we were dealing with a single layer so the inputs to that single layer so technically we had our single node right
and then we had our five inputs these are technically the inputs a minus a l minus one if you get what i mean because these inputs were the activations of the previous layer except that we didn't really have a
previous layer so they were just our inputs um so you can view these rules as kind of a generalized version of this very simple example
so what map what the rules of back propagation do basically is they do this really cumbersome calculation that we figured out and apply that to neural networks with
multiple layers so by figuring out the the error and and and to me the error in this very simple network is quite easy to understand it's
how the direct activation of this neuron affects the cost and how the direct activation of this neuron impacts the cost but when we're talking about larger networks and we're talking about
how some neuron back here affects the cost way over here many layers forward then that's a little harder to think about but it's really doing the same thing instead of
directly comparing the cost we're we're directly comparing how much changing this changes the um output we're seeing how the activation of some node here affects the layers in front of it and
the layers in front of it and layers in front of it and affect the the final cost so these are completely equivalent to each other because they're doing the exact same thing it's just this is doing it in a more obvious way when you have a single
node and it's saying saying this kind of obvious thing that oh we're just comparing this with respect to our cost which is just the single layer after so the change of how this changes is quite
obvious on the cost because it's right after but in a larger network obviously changing uh the activation of some neuron obviously affects the cost but just in a little more
harder to define way i'll leave like the mathematical super super detailed stuff up to you but um hopefully this intuition makes sense as to we're still doing the same thing in both
of our um both of our ways of looking for the derivatives we're just seeing how an error of a single node propagates through our calculations so hopefully that kind
of gives some intuition as to why we looked at these things from two different ways we kind of looked at it from a very mathematically rigorous kind of specific case at our beginning and then we kind of took these and we looked at these back propagation
rules right after but now we see that they're actually equivalent to each other um so that's pretty interesting um so after this i'm just going to get into the actual back propagation algorithm i think otherwise
unless i edit it differently so yeah see you in see you there so here we are um after explaining all these equations and understanding where they came from and how we can
get these all from kind of exploring the chain rule we can now implement these simplified algorithms without kind of thinking about the chain rule because we already kind of proved how
they work because of it and now we can combine these four into back propagation which is our grand network our grand algorithm for finding the derivative of the cost with respect to
any weight and bias and we can then plug those in in one long gradient vector of all the biases and all the weights and how they affect the cost
to implement gradient descent as we explored previously so let's explain the steps here this will actually be after thinking of all these equations will be quite easy because we're just
implementing these equations in order so what we do first is we go through the entire algorithm forward we feed forward through the
network so we input our we input our inputs x1 so let's just let's draw a neural network that we can kind of follow throughout this lecture let's make a pretty one
let's say it looks something like this i'm not gonna complete all those but you know make it look like we're trying so here we go we have some one two three four layer neural network
right here so this is technically a deep neural network um so here's our network so what we're going to do first is we're going to forward propagate through and we're going to calculate and store and the important thing is we're going
to store all these z values of the z vectors that represent all the z values of each layer and we're going to store everything we're going to store the activations
and we're going to use those in back propagation so we're going to store all those so we don't have to recompute them when going back through back propagation because if we look at all these four equations
they utilize um these activations and uh z values that we compute during forward prop so really during forward propagation so i'm going to write this down that's
important so we compute the a l's so these are vectors and the um and the the
z l's so this is obviously very very little uh extra computing uh needed because we already calculate these when we're doing normal forward propagation just to
forward propagate our inputs to the network so now instead of just discarding them we're storing these values of c's and a in vectors of each layer all right so we
got that so that's our first step so this is technically still not back propagation but this is the information that we need to do back propagation so once we get to the end
we compute the cost so after computing the cost now we have to start the back propagation algorithm so after computing the cost what we're going to do is we're going to
use our first equation here to calculate the error of al so the layer the last layer so once we do this so that's going to be finding the derivative of the cost with respect to a
and multiplying with that because since we defined uh our errors as how uh constant with respect to sum z we now have to put that through the derivative of the activation function
as i explained previously so we're going to use use equation 1 to calculate
our first layer of errors so l good so once we have this we can use the second equation just to keep going backwards to the network uh there's this second equation
after unlocking the first kind of uh well i guess it's the last but going backwards the first layer of errors we can then plug this in for here so and
so we can plug this in for here and then solve for the errors of the layer before it so looking at this let's say we have so in our
output node a l so this would be a l minus one a l minus two a l minus three i probably could have put that differently but whatever so this is our al it's a single node so we're gonna this
is gonna be a vector containing one thing so we can go backwards through our network and now calculate the vector of arrows here by plugging this one error in for here and then multiplying that by our weights
matrix that connects the two so by multiplying our weights matrix of these three by this error so let's just see if those dimensions work so what we should be expecting is the
errors of this so this should be a the errors of this layer so that should be a 3 3 by 1 vector if we calculate these so just kind of we don't have to actually concretely
calculate them let's just see if our um if our dimensions work out so we got our errors of this layer which is going to be a scalar one and then we're going to multiply
multiply that by the transpose of the weights matrix so that's going to be a uh um my slack notifications are going off um so this uh this weights matrix is going to be
so obviously the as we've talked about the the rows of a weights matrix uh is uh where it's going into right so that's going to be just one node because it's only going to one node
because remember we have our j and our k so it's only going to one node j so it's going to be one row but there's three nodes to come from so it's going to be three long so this is going to be some weight
one one weight one two weight one three but things we take the transpose of it so now we're gonna have instead of a one by three we're gonna have a three by one and obviously a three by one by a scalar is going to give us
a three by one back and then this is just an element y is multiplication of the derivatives of z one so element wise means that this uh this vector is gonna keep its dimensions so uh that's a good kind of a sandy check
that this uh that equation works because we get back uh a vector of errors for this layer that is the right dimension so that's what we do so we do that for
each number so we plug this into here and then we use the weights here which we already have and then these are z's that we calculate in the first step and we figure all that out and then now we have the errors of l minus one and then we can plug that again
again and basically using these two equations we can calculate the error of every single layer in our entire network and have this list these vectors of every error uh
every era of every node in every layer so using these two we can calculate all the errors so that's what we're going to do first we're going to calculate all the errors or networks so we're going to back propagate and calculate all our errors so back
propagation isn't actually directly computing the weights and the biases derivatives we're actually just calculating the errors when we're going backwards and once we've gone through and calculated all the errors so you can
kind of think about it as just taking this step and then taking the areas of here using this one and then doing this using this formula and then using this so we're kind of like walking through a network and kind of accumulating the errors and using this formula to
kind of uh to go backwards inside our algorithm back all the way back to the start and then once we're done then we have all the information we need to do these we can kind of do these i don't know i'm
trying to think of a better word than postmortem but after we are finished our entire algorithm we can just use the values here computed to just kind of uh in retrospect or just kind of
at the after we're done calculate these uh derivatives of the cost respect to the baits biases and the weights so we're going to use just to finish this here so 4 we're
going to use equation 2 to compute all of these error terms so after doing that then
we're going to just you're going to see how well directly the derivative of the cost respect to the biases each of those vectors for each of the layers is going to be directly
equal to the vectors of the errors so we can actually automatically know by calculating these vectors of errors for each node the derivative of the cost respect the
bias for each layer of each node so using so it's barely even equation really we're just setting one thing equal and other things so once we've find found all of our
of our errors we can just set those equal to the derivative the cost of the bias so um use equation uh 3 to
find bias derivatives it's rather not that's but it's the derivative that the cost was took the bias but you get the point and then our final step six is just to
since we already computed all these a terms during our forward propagation we can just plug those in transpose them and multiply by um multiply by our errors to get
uh the those matrices that we're talking about those matrices of derivatives so calculate weight uh derivatives so these are all going to be uh these
are all going to be these are all going to be matrices the same size as the weight matrices as we explored and are kind of vectorizing the weights formula um so what we're going to do is
calculate the weights uh using equation four so so using these six rules of back
propagation we can calculate our gradient of the cost with respect to all weights and biases so the super long
cost weights and biases this is going to be a super long thing and we can plug this right into gradient descent like we learned about before
so just to close this entire lecture uh let's just finally for completion plug um all these in and of course these are going to be disjointed we're going to have lots of different vectors
of cost with respect to the bias we're going to have for each layer we're going to have one kind of uh gradient of the cost with respect to each bias
for each layer and then we're going to have the same thing except we're going to have matrices of derivatives for each weight layer so what we're going to do to combine
this into one big gradient like what gradient descent wants is we're just going to we're just going to unroll it's called to unroll these uh
matrices so basically just keep all the elements but store them in a vector so if you have some a b c d e f just to put them in some a b c d
e f and just store them in a vector and then we already have all the biases in vector form so what we're going to do is we're just going to take these unrolled vectors and concatenate them
uh with all the bias vectors and we'll end up with one really long gradient so now we can plug this into our
gradient descent formula which is the all the parameters maybe it might be better to do yeah let's use this symbol to signify
all our parameters and biases so what working what we can do is we can update them all so this is gonna be a gradient and this is gonna be a vector that's the same size of our gradient because it'll include all weights and biases in one
long vector and then we're gonna we're gonna update this with the negative of some learning rate times
our gradient of the cost with respect to all biases and weights so this is just going to be an elementalized subtraction we already talked about
gradients inside the descent so i won't talk about it too much but it's this so this is the thing that we plug it into and then gradient descent is an iterative algorithm so
what we'll do is we'll we'll do a forward pass through our network for the first time and then we'll do a backward pass and then we'll update all those wastes with our gradient
with spectrum using all this data and using the grading descent for this and then what we'll do is we'll forward pass again through our network with these new weights and hopefully our cost is slightly lower
now and then we will back propagate through this again and see how we can nudge all our weights and biases by a little bit again and then we'll go forward and then back
and then forward forward back update forward back update forward back update for um however long we have to
can go up to thousands tens of thousands of iterations and hopefully our cost converges onto some local or global minima
that you know we see the cost lowering each time each iteration should should be something should be something that looks like a steady kind of drop
and in the cost might go up a little bit with our costs and iterations but hopefully uh well at some point we're going to kind of just settle and we're going to have to keep it there and then that's going to be where we
converge so that's it um that's all we need to know so that's it um i hope you enjoyed it
i hope you learned something i hope you got something out of this lecture i know i did and yeah just thanks for sticking around so just thinking about looking forward i
guess and the things that you can do after this and after kind of understanding the mathematics of back propagation there's a few different things you can dive into you can dive into the further theoretical side of things you can talk
about how back profit you can learn about how back propagation operates and different types of neural networks because there's such a such an array of different types of neural networks this is just the simplest and most basic
of the many many neural networks that exist right now so you can think about how back propagation works in convolutional neural networks or recurrent neural networks or how it works in a whole plethora of
different algorithms and then you can also go into the practical side of this we didn't talk about any code in this entire lecture but certainly these applications these knowledge can be
applied in real applied settings with things such as if you're coding in pytorch if you're coding in python you might be using pytorch or tensorflow or keras
probably pi torture tensorflow for neural networks um and you can apply to that also uh one thing that i might recommend is
trying out uh doing very simple neural network tasks like maybe the amnest database uh classifying you probably have heard of the classifying um numbers
at classifying digits task uh with m with the mnist database um trying something like that with a uh artificial neural network like the one we learned about
and try doing it without any external libraries except for maybe numpy or something like that and just seeing how you can create from the ground up a neural network using these
rules of back propagation and these rules of forward propagation and using these equations to create a neural network before stepping into things like pi torch or tensorflow where we have built-in libraries to help
us with that so that would be my recommendation for looking forward but um of course it's up to you now and i thank you for watching this and i hope you got something out of it
so see you soon maybe bye [Music] you
Loading video analysis...