The spelled-out intro to neural networks and backpropagation: building micrograd
By Andrej Karpathy
Summary
## Key takeaways - **Backpropagation is the core of neural networks**: Backpropagation is the algorithm that allows efficient evaluation of gradients for loss functions with respect to neural network weights, enabling iterative tuning to minimize loss and improve accuracy. It's the mathematical core of modern deep learning libraries. [00:57], [01:16] - **Micrograd: building blocks for neural nets**: Micrograd allows building mathematical expressions by wrapping numbers in 'Value' objects. These objects track their data, their children nodes, and the operation that produced them, forming an expression graph. [00:26], [01:48] - **Derivatives measure function sensitivity**: A derivative quantifies how a function's output changes in response to a small change in its input. It represents the slope or sensitivity at a specific point. [10:34], [11:13] - **Chain rule chains local derivatives**: The chain rule in calculus allows calculating the derivative of a composite function by multiplying the derivatives of the individual functions. This is how gradients are propagated backward through complex expression graphs. [41:41], [43:10] - **Neural networks are complex expressions**: Neural networks are essentially complex mathematical expressions where layers of neurons process data sequentially. Each neuron performs a weighted sum of inputs, adds a bias, and passes the result through an activation function. [04:49], [53:31] - **Zeroing gradients is crucial for training**: During neural network training, gradients accumulate if not reset. It's essential to zero out gradients before each backward pass to ensure accurate updates based on the current loss. [02:10:59], [02:11:31]
Topics Covered
- Everything besides backpropagation is just for efficiency.
- Backpropagation is a recursive application of the chain rule.
- The abstraction level of your functions is arbitrary.
- This is the most common bug in neural networks.
- The four-step loop that trains all neural networks.
Full Transcript
hello my name is andre
and i've been training deep neural
networks for a bit more than a decade
and in this lecture i'd like to show you
what neural network training looks like
under the hood so in particular we are
going to start with a blank jupiter
notebook and by the end of this lecture
we will define and train in neural net
and you'll get to see everything that
goes on under the hood and exactly
sort of how that works on an intuitive
level
now specifically what i would like to do
is i would like to take you through
building of micrograd now micrograd is
this library that i released on github
about two years ago but at the time i
only uploaded the source code and you'd
have to go in by yourself and really
figure out how it works
so in this lecture i will take you
through it step by step and kind of
comment on all the pieces of it so what
is micrograd and why is it interesting
good
um
micrograd is basically an autograd
engine autograd is short for automatic
gradient and really what it does is it
implements backpropagation now
backpropagation is this algorithm that
allows you to efficiently evaluate the
gradient of
some kind of a loss function with
respect to the weights of a neural
network and what that allows us to do
then is we can iteratively tune the
weights of that neural network to
minimize the loss function and therefore
improve the accuracy of the network so
back propagation would be at the
mathematical core of any modern deep
neural network library like say pytorch
or jaxx
so the functionality of microgrant is i
think best illustrated by an example so
if we just scroll down here
you'll see that micrograph basically
allows you to build out mathematical
expressions
and um here what we are doing is we have
an expression that we're building out
where you have two inputs a and b
and you'll see that a and b are negative
four and two but we are wrapping those
values into this value object that we
are going to build out as part of
micrograd
so this value object will wrap the
numbers themselves
and then we are going to build out a
mathematical expression here where a and
b are transformed into c d and
eventually e f and g
and i'm showing some of the functions
some of the functionality of micrograph
and the operations that it supports so
you can add two value objects you can
multiply them you can raise them to a
constant power you can offset by one
negate squash at zero
square divide by constant divide by it
etc
and so we're building out an expression
graph with with these two inputs a and b
and we're creating an output value of g
and micrograd will in the background
build out this entire mathematical
expression so it will for example know
that c is also a value
c was a result of an addition operation
and the
child nodes of c are a and b because the
and will maintain pointers to a and b
value objects so we'll basically know
exactly how all of this is laid out
and then not only can we do what we call
the forward pass where we actually look
at the value of g of course that's
pretty straightforward we will access
that using the dot data attribute and so
the output of the forward pass the value
of g is 24.7 it turns out but the big
deal is that we can also take this g
value object and we can call that
backward
and this will basically uh initialize
back propagation at the node g
and what backpropagation is going to do
is it's going to start at g and it's
going to go backwards through that
expression graph and it's going to
recursively apply the chain rule from
calculus
and what that allows us to do then is
we're going to evaluate basically the
derivative of g with respect to all the
internal nodes
like e d and c but also with respect to
the inputs a and b
and then we can actually query this
derivative of g with respect to a for
example that's a dot grad in this case
it happens to be 138 and the derivative
of g with respect to b
which also happens to be here 645
and this derivative we'll see soon is
very important information because it's
telling us how a and b are affecting g
through this mathematical expression so
in particular
a dot grad is 138 so if we slightly
nudge a and make it slightly larger
138 is telling us that g will grow and
the slope of that growth is going to be
138
and the slope of growth of b is going to
be 645. so that's going to tell us about
how g will respond if a and b get
tweaked a tiny amount in a positive
direction
okay
now you might be confused about what
this expression is that we built out
here and this expression by the way is
completely meaningless i just made it up
i'm just flexing about the kinds of
operations that are supported by
micrograd
what we actually really care about are
neural networks but it turns out that
neural networks are just mathematical
expressions just like this one but
actually slightly bit less crazy even
neural networks are just a mathematical
expression they take the input data as
an input and they take the weights of a
neural network as an input and it's a
mathematical expression and the output
are your predictions of your neural net
or the loss function we'll see this in a
bit but basically neural networks just
happen to be a certain class of
mathematical expressions
but back propagation is actually
significantly more general it doesn't
actually care about neural networks at
all it only tells us about arbitrary
mathematical expressions and then we
happen to use that machinery for
training of neural networks now one more
note i would like to make at this stage
is that as you see here micrograd is a
scalar valued auto grant engine so it's
working on the you know level of
individual scalars like negative four
and two and we're taking neural nets and
we're breaking them down all the way to
these atoms of individual scalars and
all the little pluses and times and it's
just excessive and so obviously you
would never be doing any of this in
production it's really just put down for
pedagogical reasons because it allows us
to not have to deal with these
n-dimensional tensors that you would use
in modern deep neural network library so
this is really done so that you
understand and refactor out back
propagation and chain rule and
understanding of neurologic training
and then if you actually want to train
bigger networks you have to be using
these tensors but none of the math
changes this is done purely for
efficiency we are basically taking scale
value
all the scale values we're packaging
them up into tensors which are just
arrays of these scalars and then because
we have these large arrays we're making
operations on those large arrays that
allows us to take advantage of the
parallelism in a computer and all those
operations can be done in parallel and
then the whole thing runs faster but
really none of the math changes and
that's done purely for efficiency so i
don't think that it's pedagogically
useful to be dealing with tensors from
scratch uh and i think and that's why i
fundamentally wrote micrograd because
you can understand how things work uh at
the fundamental level and then you can
speed it up later okay so here's the fun
part my claim is that micrograd is what
you need to train your networks and
everything else is just efficiency so
you'd think that micrograd would be a
very complex piece of code and that
turns out to not be the case
so if we just go to micrograd
and you'll see that there's only two
files here in micrograd this is the
actual engine it doesn't know anything
about neural nuts and this is the entire
neural nets library
on top of micrograd so engine and nn.pi
so the actual backpropagation autograd
engine
that gives you the power of neural
networks is literally
100 lines of code of like very simple
python
which we'll understand by the end of
this lecture
and then nn.pi
this neural network library built on top
of the autograd engine
um is like a joke it's like
we have to define what is a neuron and
then we have to define what is the layer
of neurons and then we define what is a
multi-layer perceptron which is just a
sequence of layers of neurons and so
it's just a total joke
so basically
there's a lot of power that comes from
only 150 lines of code
and that's all you need to understand to
understand neural network training and
everything else is just efficiency and
of course there's a lot to efficiency
but fundamentally that's all that's
happening okay so now let's dive right
in and implement micrograph step by step
the first thing i'd like to do is i'd
like to make sure that you have a very
good understanding intuitively of what a
derivative is and exactly what
information it gives you so let's start
with some basic imports that i copy
paste in every jupiter notebook always
and let's define a function a scalar
valued function
f of x
as follows
so i just make this up randomly i just
want to scale a valid function that
takes a single scalar x and returns a
single scalar y
and we can call this function of course
so we can pass in say 3.0 and get 20
back
now we can also plot this function to
get a sense of its shape you can tell
from the mathematical expression that
this is probably a parabola it's a
quadratic
and so if we just uh create a set of um
um
scale values that we can feed in using
for example a range from negative five
to five in steps of 0.25
so this is so axis is just from negative
5 to 5 not including 5 in steps of 0.25
and we can actually call this function
on this numpy array as well so we get a
set of y's if we call f on axis
and these y's are basically
also applying a function on every one of
these elements independently
and we can plot this using matplotlib so
plt.plot x's and y's and we get a nice
parabola so previously here we fed in
3.0 somewhere here and we received 20
back which is here the y coordinate so
now i'd like to think through
what is the derivative
of this function at any single input
point x
right so what is the derivative at
different points x of this function now
if you remember back to your calculus
class you've probably derived
derivatives so we take this mathematical
expression 3x squared minus 4x plus 5
and you would write out on a piece of
paper and you would you know apply the
product rule and all the other rules and
derive the mathematical expression of
the great derivative of the original
function and then you could plug in
different texts and see what the
derivative is
we're not going to actually do that
because no one in neural networks
actually writes out the expression for
the neural net it would be a massive
expression um it would be you know
thousands tens of thousands of terms no
one actually derives the derivative of
course and so we're not going to take
this kind of like a symbolic approach
instead what i'd like to do is i'd like
to look at the definition of derivative
and just make sure that we really
understand what derivative is measuring
what it's telling you about the function
and so if we just look up derivative
we see that
okay so this is not a very good
definition of derivative this is a
definition of what it means to be
differentiable
but if you remember from your calculus
it is the limit as h goes to zero of f
of x plus h minus f of x over h so
basically what it's saying is if you
slightly bump up you're at some point x
that you're interested in or a and if
you slightly bump up
you know you slightly increase it by
small number h
how does the function respond with what
sensitivity does it respond what is the
slope at that point does the function go
up or does it go down and by how much
and that's the slope of that function
the
the slope of that response at that point
and so we can basically evaluate
the derivative here numerically by
taking a very small h of course the
definition would ask us to take h to
zero we're just going to pick a very
small h 0.001
and let's say we're interested in point
3.0 so we can look at f of x of course
as 20
and now f of x plus h
so if we slightly nudge x in a positive
direction how is the function going to
respond
and just looking at this do you expect
do you expect f of x plus h to be
slightly greater than 20 or do you
expect to be slightly lower than 20
and since this 3 is here and this is 20
if we slightly go positively the
function will respond positively so
you'd expect this to be slightly greater
than 20. and now by how much it's
telling you the
sort of the
the strength of that slope right the the
size of the slope so f of x plus h minus
f of x this is how much the function
responded
in the positive direction and we have to
normalize by the
run so we have the rise over run to get
the slope so this of course is just a
numerical approximation of the slope
because we have to make age very very
small to converge to the exact amount
now if i'm doing too many zeros
at some point
i'm gonna get an incorrect answer
because we're using floating point
arithmetic and the representations of
all these numbers in computer memory is
finite and at some point we get into
trouble
so we can converse towards the right
answer with this approach
but basically um at 3 the slope is 14.
and you can see that by taking 3x
squared minus 4x plus 5 and
differentiating it in our head
so 3x squared would be
6 x minus 4
and then we plug in x equals 3 so that's
18 minus 4 is 14. so this is correct
so that's
at 3. now how about the slope at say
negative 3
would you expect would you expect for
the slope
now telling the exact value is really
hard but what is the sign of that slope
so at negative three
if we slightly go in the positive
direction at x the function would
actually go down and so that tells you
that the slope would be negative so
we'll get a slight number below
below 20. and so if we take the slope we
expect something negative
negative 22. okay
and at some point here of course the
slope would be zero now for this
specific function i looked it up
previously and it's at point two over
three
so at roughly two over three
uh that's somewhere here
um
this derivative be zero
so basically at that precise point
yeah
at that precise point if we nudge in a
positive direction the function doesn't
respond this stays the same almost and
so that's why the slope is zero okay now
let's look at a bit more complex case
so we're going to start you know
complexifying a bit so now we have a
function
here
with output variable d
that is a function of three scalar
inputs a b and c
so a b and c are some specific values
three inputs into our expression graph
and a single output d
and so if we just print d we get four
and now what i have to do is i'd like to
again look at the derivatives of d with
respect to a b and c
and uh think through uh again just the
intuition of what this derivative is
telling us
so in order to evaluate this derivative
we're going to get a bit hacky here
we're going to again have a very small
value of h
and then we're going to fix the inputs
at some
values that we're interested in
so these are the this is the point abc
at which we're going to be evaluating
the the
derivative of d with respect to all a b
and c at that point
so there are the inputs and now we have
d1 is that expression
and then we're going to for example look
at the derivative of d with respect to a
so we'll take a and we'll bump it by h
and then we'll get d2 to be the exact
same function
and now we're going to print um
you know f1
d1 is d1
d2 is d2
and print slope
so the derivative or slope
here will be um
of course
d2
minus d1 divide h
so d2 minus d1 is how much the function
increased
uh when we bumped
the uh
the specific input that we're interested
in by a tiny amount
and
this is then normalized by h
to get the slope
so
um
yeah
so this so if i just run this we're
going to print
d1
which we know is four
now d2 will be bumped a will be bumped
by h
so let's just think through
a little bit uh what d2 will be uh
printed out here
in particular
d1 will be four
will d2 be a number slightly greater
than four or slightly lower than four
and that's going to tell us the sl the
the sign of the derivative
so
we're bumping a by h
b as minus three c is ten
so you can just intuitively think
through this derivative and what it's
doing a will be slightly more positive
and but b is a negative number
so if a is slightly more positive
because b is negative three
we're actually going to be adding less
to d
so you'd actually expect that the value
of the function will go down
so let's just see this
yeah and so we went from 4
to 3.9996
and that tells you that the slope will
be negative
and then
uh will be a negative number
because we went down
and then
the exact number of slope will be
exact amount of slope is negative 3.
and you can also convince yourself that
negative 3 is the right answer
mathematically and analytically because
if you have a times b plus c and you are
you know you have calculus then
differentiating a times b plus c with
respect to a gives you just b
and indeed the value of b is negative 3
which is the derivative that we have so
you can tell that that's correct
so now if we do this with b
so if we bump b by a little bit in a
positive direction we'd get different
slopes so what is the influence of b on
the output d
so if we bump b by a tiny amount in a
positive direction then because a is
positive
we'll be adding more to d
right
so um and now what is the what is the
sensitivity what is the slope of that
addition
and it might not surprise you that this
should be
2
and y is a 2 because d of d
by db differentiating with respect to b
would be would give us a
and the value of a is two so that's also
working well
and then if c gets bumped a tiny amount
in h
by h
then of course a times b is unaffected
and now c becomes slightly bit higher
what does that do to the function it
makes it slightly bit higher because
we're simply adding c
and it makes it slightly bit higher by
the exact same amount that we added to c
and so that tells you that the slope is
one
that will be the
the rate at which
d will increase as we scale
c
okay so we now have some intuitive sense
of what this derivative is telling you
about the function and we'd like to move
to neural networks now as i mentioned
neural networks will be pretty massive
expressions mathematical expressions so
we need some data structures that
maintain these expressions and that's
what we're going to start to build out
now
so we're going to
build out this value object that i
showed you in the readme page of
micrograd
so let me copy paste a skeleton of the
first very simple value object
so class value takes a single
scalar value that it wraps and keeps
track of
and that's it so
we can for example do value of 2.0 and
then we can
get we can look at its content and
python will internally
use the wrapper function
to uh return
uh this string oops
like that
so this is a value object with data
equals two that we're creating here
now we'd like to do is like we'd like to
be able to
have not just like two values
but we'd like to do a bluffy right we'd
like to add them
so currently you would get an error
because python doesn't know how to add
two value objects so we have to tell it
so here's
addition
so you have to basically use these
special double underscore methods in
python to define these operators for
these objects so if we call um
the uh if we use this plus operator
python will internally call a dot add of
b
that's what will happen internally and
so b will be the other and
self will be a
and so we see that what we're going to
return is a new value object and it's
just it's going to be wrapping
the plus of
their data
but remember now because data is the
actual like numbered python number so
this operator here is just the typical
floating point plus addition now it's
not an addition of value objects
and will return a new value so now a
plus b should work and it should print
value of
negative one
because that's two plus minus three
there we go
okay let's now implement multiply
just so we can recreate this expression
here
so multiply i think it won't surprise
you will be fairly similar
so instead of add we're going to be
using mul
and then here of course we want to do
times
and so now we can create a c value
object which will be 10.0 and now we
should be able to do a times b well
let's just do a times b first
um
[Music]
that's value of negative six now
and by the way i skipped over this a
little bit suppose that i didn't have
the wrapper function here
then it's just that you'll get some kind
of an ugly expression so what wrapper is
doing is it's providing us a way to
print out like a nicer looking
expression in python
uh so we don't just have something
cryptic we actually are you know it's
value of
negative six so this gives us a times
and then this we should now be able to
add c to it because we've defined and
told the python how to do mul and add
and so this will call this will
basically be equivalent to a dot
small
of b
and then this new value object will be
dot add
of c
and so let's see if that worked
yep so that worked well that gave us
four which is what we expect from before
and i believe we can just call them
manually as well there we go so
yeah
okay so now what we are missing is the
connective tissue of this expression as
i mentioned we want to keep these
expression graphs so we need to know and
keep pointers about what values produce
what other values
so here for example we are going to
introduce a new variable which we'll
call children and by default it will be
an empty tuple
and then we're actually going to keep a
slightly different variable in the class
which we'll call underscore prev which
will be the set of children
this is how i done i did it in the
original micrograd looking at my code
here i can't remember exactly the reason
i believe it was efficiency but this
underscore children will be a tuple for
convenience but then when we actually
maintain it in the class it will be just
this set yeah i believe for efficiency
um
so now
when we are creating a value like this
with a constructor children will be
empty and prep will be the empty set but
when we're creating a value through
addition or multiplication we're going
to feed in the children of this value
which in this case is self and other
so those are the children
here
so now we can do d dot prev
and we'll see that the children of the
we now know are this value of negative 6
and value of 10 and this of course is
the value resulting from a times b and
the c value which is 10.
now the last piece of information we
don't know so we know that the children
of every single value but we don't know
what operation created this value
so we need one more element here let's
call it underscore pop
and by default this is the empty set for
leaves
and then we'll just maintain it here
and now the operation will be just a
simple string and in the case of
addition it's plus in the case of
multiplication is times
so now we
not just have d dot pref we also have a
d dot up
and we know that d was produced by an
addition of those two values and so now
we have the full
mathematical expression uh and we're
building out this data structure and we
know exactly how each value came to be
by word expression and from what other
values
now because these expressions are about
to get quite a bit larger we'd like a
way to nicely visualize these
expressions that we're building out so
for that i'm going to copy paste a bunch
of slightly scary code that's going to
visualize this these expression graphs
for us
so here's the code and i'll explain it
in a bit but first let me just show you
what this code does
basically what it does is it creates a
new function drawdot that we can call on
some root node
and then it's going to visualize it so
if we call drawdot on d
which is this final value here that is a
times b plus c
it creates something like this so this
is d
and you see that this is a times b
creating an integrated value plus c
gives us this output node d
so that's dried out of d
and i'm not going to go through this in
complete detail you can take a look at
graphless and its api uh graphis is a
open source graph visualization software
and what we're doing here is we're
building out this graph and graphis
api and
you can basically see that trace is this
helper function that enumerates all of
the nodes and edges in the graph
so that just builds a set of all the
nodes and edges and then we iterate for
all the nodes and we create special node
objects
for them in
using dot node
and then we also create edges using dot
dot edge
and the only thing that's like slightly
tricky here is you'll notice that i
basically add these fake nodes which are
these operation nodes so for example
this node here is just like a plus node
and
i create these
special op nodes here
and i connect them accordingly so these
nodes of course are not actual
nodes in the original graph
they're not actually a value object the
only value objects here are the things
in squares those are actual value
objects or representations thereof and
these op nodes are just created in this
drawdot routine so that it looks nice
let's also add labels to these graphs
just so we know what variables are where
so let's create a special underscore
label
um
or let's just do label
equals empty by default and save it in
each node
and then here we're going to do label as
a
label is the
label a c
and then
let's create a special um
e equals a times b
and e dot label will be e
it's kind of naughty
and e will be e plus c
and a d dot label will be
d
okay so nothing really changes i just
added this new e function
a new e variable
and then here when we are
printing this
i'm going to print the label here so
this will be a percent s
bar
and this will be end.label
and so now
we have the label on the left here so it
says a b creating e and then e plus c
creates d
just like we have it here
and finally let's make this expression
just one layer deeper
so d will not be the final output node
instead after d we are going to create a
new value object
called f we're going to start running
out of variables soon f will be negative
2.0
and its label will of course just be f
and then l capital l will be the output
of our graph
and l will be p times f
okay
so l will be negative eight is the
output
so
now we don't just draw a d we draw l
okay
and somehow the label of
l was undefined oops all that label has
to be explicitly sort of given to it
there we go so l is the output
so let's quickly recap what we've done
so far
we are able to build out mathematical
expressions using only plus and times so
far
they are scalar valued along the way
and we can do this forward pass
and build out a mathematical expression
so we have multiple inputs here a b c
and f
going into a mathematical expression
that produces a single output l
and this here is visualizing the forward
pass so the output of the forward pass
is negative eight that's the value
now what we'd like to do next is we'd
like to run back propagation
and in back propagation we are going to
start here at the end and we're going to
reverse
and calculate the gradient along along
all these intermediate values
and really what we're computing for
every single value here
um we're going to compute the derivative
of that node with respect to l
so
the derivative of l with respect to l is
just uh one
and then we're going to derive what is
the derivative of l with respect to f
with respect to d with respect to c with
respect to e
with respect to b and with respect to a
and in the neural network setting you'd
be very interested in the derivative of
basically this loss function l
with respect to the weights of a neural
network
and here of course we have just these
variables a b c and f
but some of these will eventually
represent the weights of a neural net
and so we'll need to know how those
weights are impacting
the loss function so we'll be interested
basically in the derivative of the
output with respect to some of its leaf
nodes and those leaf nodes will be the
weights of the neural net
and the other leaf nodes of course will
be the data itself but usually we will
not want or use the derivative of the
loss function with respect to data
because the data is fixed but the
weights will be iterated on
using the gradient information so next
we are going to create a variable inside
the value class that maintains the
derivative of l with respect to that
value
and we will call this variable grad
so there's a data and there's a
self.grad
and initially it will be zero and
remember that zero is basically means no
effect so at initialization we're
assuming that every value does not
impact does not affect the out the
output
right because if the gradient is zero
that means that changing this variable
is not changing the loss function
so by default we assume that the
gradient is zero
and then
now that we have grad and it's 0.0
we are going to be able to visualize it
here after data so here grad is 0.4 f
and this will be in that graph
and now we are going to be showing both
the data and the grad
initialized at zero
and we are just about getting ready to
calculate the back propagation
and of course this grad again as i
mentioned is representing
the derivative of the output in this
case l with respect to this value so
with respect to so this is the
derivative of l with respect to f with
respect to d and so on so let's now fill
in those gradients and actually do back
propagation manually so let's start
filling in these gradients and start all
the way at the end as i mentioned here
first we are interested to fill in this
gradient here so what is the derivative
of l with respect to l
in other words if i change l by a tiny
amount of h
how much does
l change
it changes by h so it's proportional and
therefore derivative will be one
we can of course measure these or
estimate these numerical gradients
numerically just like we've seen before
so if i take this expression
and i create a def lol function here
and put this here now the reason i'm
creating a gating function hello here is
because i don't want to pollute or mess
up the global scope here this is just
kind of like a little staging area and
as you know in python all of these will
be local variables to this function so
i'm not changing any of the global scope
here
so here l1 will be l
and then copy pasting this expression
we're going to add a small amount h
in for example a
right and this would be measuring the
derivative of l with respect to a
so here this will be l2
and then we want to print this
derivative so print
l2 minus l1 which is how much l changed
and then normalize it by h so this is
the rise over run
and we have to be careful because l is a
value node so we actually want its data
um
so that these are floats dividing by h
and this should print the derivative of
l with respect to a because a is the one
that we bumped a little bit by h
so what is the
derivative of l with respect to a
it's six
okay and obviously
if we change l by h
then that would be
here effectively
this looks really awkward but changing l
by h
you see the derivative here is 1. um
that's kind of like the base case of
what we are doing here
so basically we cannot come up here and
we can manually set l.grad to one this
is our manual back propagation
l dot grad is one and let's redraw
and we'll see that we filled in grad as
1 for l
we're now going to continue the back
propagation so let's here look at the
derivatives of l with respect to d and f
let's do a d first
so what we are interested in if i create
a markdown on here is we'd like to know
basically we have that l is d times f
and we'd like to know what is uh d
l by d d
what is that
and if you know your calculus uh l is d
times f so what is d l by d d it would
be f
and if you don't believe me we can also
just derive it because the proof would
be fairly straightforward uh we go to
the
definition of the derivative which is f
of x plus h minus f of x divide h
as a limit limit of h goes to zero of
this kind of expression so when we have
l is d times f
then increasing d by h
would give us the output of b plus h
times f
that's basically f of x plus h right
minus d times f
and then divide h and symbolically
expanding out here we would have
basically d times f plus h times f minus
t times f divide h
and then you see how the df minus df
cancels so you're left with h times f
divide h
which is f
so in the limit as h goes to zero of
you know
derivative
definition we just get f in the case of
d times f
so
symmetrically
dl by d
f will just be d
so what we have is that f dot grad
we see now is just the value of d
which is 4.
and we see that
d dot grad
is just uh the value of f
and so the value of f is negative two
so we'll set those manually
let me erase this markdown node and then
let's redraw what we have
okay
and let's just make sure that these were
correct so we seem to think that dl by
dd is negative two so let's double check
um let me erase this plus h from before
and now we want the derivative with
respect to f
so let's just come here when i create f
and let's do a plus h here and this
should print the derivative of l with
respect to f so we expect to see four
yeah and this is four up to floating
point
funkiness
and then dl by dd
should be f which is negative two
grad is negative two
so if we again come here and we change d
d dot data plus equals h right here
so we expect so we've added a little h
and then we see how l changed and we
expect to print
uh negative two
there we go
so we've numerically verified what we're
doing here is what kind of like an
inline gradient check gradient check is
when we
are deriving this like back propagation
and getting the derivative with respect
to all the intermediate results and then
numerical gradient is just you know
estimating it using small step size
now we're getting to the crux of
backpropagation so this will be the most
important node to understand because if
you understand the gradient for this
node you understand all of back
propagation and all of training of
neural nets basically
so we need to derive dl by bc
in other words the derivative of l with
respect to c
because we've computed all these other
gradients already
now we're coming here and we're
continuing the back propagation manually
so we want dl by dc and then we'll also
derive dl by de
now here's the problem
how do we derive dl
by dc
we actually know the derivative l with
respect to d so we know how l assessed
it to d
but how is l sensitive to c so if we
wiggle c how does that impact l
through d
so we know dl by dc
and we also here know how c impacts d
and so just very intuitively if you know
the impact that c is having on d and the
impact that d is having on l
then you should be able to somehow put
that information together to figure out
how c impacts l
and indeed this is what we can actually
do so in particular we know just
concentrating on d first let's look at
how what is the derivative basically of
d with respect to c so in other words
what is dd by dc
so here we know that d is c times c plus
e
that's what we know and now we're
interested in dd by dc
if you just know your calculus again and
you remember that differentiating c plus
e with respect to c you know that that
gives you
1.0
and we can also go back to the basics
and derive this because again we can go
to our f of x plus h minus f of x
divide by h
that's the definition of a derivative as
h goes to zero
and so here
focusing on c and its effect on d
we can basically do the f of x plus h
will be
c is incremented by h plus e
that's the first evaluation of our
function minus
c plus e
and then divide h
and so what is this
uh just expanding this out this will be
c plus h plus e minus c minus e
divide h and then you see here how c
minus c cancels e minus e cancels we're
left with h over h which is 1.0
and so
by symmetry also d d by d
e
will be 1.0 as well
so basically the derivative of a sum
expression is very simple and and this
is the local derivative so i call this
the local derivative because we have the
final output value all the way at the
end of this graph and we're now like a
small node here
and this is a little plus node
and it the little plus node doesn't know
anything about the rest of the graph
that it's embedded in all it knows is
that it did a plus it took a c and an e
added them and created d
and this plus note also knows the local
influence of c on d or rather rather the
derivative of d with respect to c and it
also
knows the derivative of d with respect
to e but that's not what we want that's
just a local derivative what we actually
want is d l by d c and l could l is here
just one step away but in a general case
this little plus note is could be
embedded in like a massive graph
so
again we know how l impacts d and now we
know how c and e impact d how do we put
that information together to write dl by
dc and the answer of course is the chain
rule in calculus
and so um
i pulled up a chain rule here from
kapedia
and
i'm going to go through this very
briefly so chain rule
wikipedia sometimes can be very
confusing and calculus can
can be very confusing like this is the
way i
learned
chain rule and it was very confusing
like what is happening it's just
complicated so i like this expression
much better
if a variable z depends on a variable y
which itself depends on the variable x
then z depends on x as well obviously
through the intermediate variable y
in this case the chain rule is expressed
as
if you want dz by dx
then you take the dz by dy and you
multiply it by d y by dx
so the chain rule fundamentally is
telling you
how
we chain these
uh derivatives together
correctly so to differentiate through a
function composition
we have to apply a multiplication
of
those derivatives
so that's really what chain rule is
telling us
and there's a nice little intuitive
explanation here which i also think is
kind of cute the chain rule says that
knowing the instantaneous rate of change
of z with respect to y and y relative to
x allows one to calculate the
instantaneous rate of change of z
relative to x
as a product of those two rates of
change
simply the product of those two
so here's a good one
if a car travels twice as fast as
bicycle and the bicycle is four times as
fast as walking man
then the car travels two times four
eight times as fast as demand
and so this makes it very clear that the
correct thing to do sort of
is to multiply
so
cars twice as fast as bicycle and
bicycle is four times as fast as man
so the car will be eight times as fast
as the man and so we can take these
intermediate rates of change if you will
and multiply them together
and that justifies the
chain rule intuitively so have a look at
chain rule about here really what it
means for us is there's a very simple
recipe for deriving what we want which
is dl by dc
and what we have so far
is we know
want
and we know
what is the
impact of d on l so we know d l by
d d the derivative of l with respect to
d d we know that that's negative two
and now because of this local
reasoning that we've done here we know
dd by d
c
so how does c impact d and in
particular this is a plus node so the
local derivative is simply 1.0 it's very
simple
and so
the chain rule tells us that dl by dc
going through this intermediate variable
will just be simply d l by
d
times
dd by dc
that's chain rule
so this is identical to what's happening
here
except
z is rl
y is our d and x is rc
so we literally just have to multiply
these
and because
these local derivatives like dd by dc
are just one
we basically just copy over dl by dd
because this is just times one
so what does it do so because dl by dd
is negative two what is dl by dc
well it's the local gradient 1.0 times
dl by dd which is negative two
so literally what a plus node does you
can look at it that way is it literally
just routes the gradient
because the plus nodes local derivatives
are just one and so in the chain rule
one times
dl by dd
is um
is uh is just dl by dd and so that
derivative just gets routed to both c
and to e in this case
so basically um we have that that grad
or let's start with c since that's the
one we looked at
is
negative two times one
negative two
and in the same way by symmetry e that
grad will be negative two that's the
claim so we can set those
we can redraw
and you see how we just assign negative
to negative two so this backpropagating
signal which is carrying the information
of like what is the derivative of l with
respect to all the intermediate nodes
we can imagine it almost like flowing
backwards through the graph and a plus
node will simply distribute the
derivative to all the leaf nodes sorry
to all the children nodes of it
so this is the claim and now let's
verify it so let me remove the plus h
here from before
and now instead what we're going to do
is we're going to increment c so c dot
data will be credited by h
and when i run this we expect to see
negative 2
negative 2. and then of course for e
so e dot data plus equals h and we
expect to see negative 2.
simple
so those are the derivatives of these
internal nodes
and now we're going to recurse our way
backwards again
and we're again going to apply the chain
rule so here we go our second
application of chain rule and we will
apply it all the way through the graph
we just happen to only have one more
node remaining
we have that d l
by d e
as we have just calculated is negative
two so we know that
so we know the derivative of l with
respect to e
and now we want dl
by
da
right
and the chain rule is telling us that
that's just dl by de
negative 2
times the local gradient so what is the
local gradient basically d e
by d a
we have to look at that
so i'm a little times node
inside a massive graph
and i only know that i did a times b and
i produced an e
so now what is d e by d a and d e by d b
that's the only thing that i sort of
know about that's my local gradient
so
because we have that e's a times b we're
asking what is d e by d a
and of course we just did that here we
had a
times so i'm not going to rederive it
but if you want to differentiate this
with respect to a you'll just get b
right the value of b
which in this case is negative 3.0
so
basically we have that dl by da
well let me just do it right here we
have that a dot grad and we are applying
chain rule here
is d l by d e which we see here is
negative two
times
what is d e by d a
it's the value of b which is negative 3.
that's it
and then we have b grad is again dl by
de
which is negative 2
just the same way
times
what is d e by d
um db
is the value of a which is 2.2.0
as the value of a
so these are our claimed derivatives
let's
redraw
and we see here that
a dot grad turns out to be 6 because
that is negative 2 times negative 3
and b dot grad is negative 4
times sorry is negative 2 times 2 which
is negative 4.
so those are our claims let's delete
this and let's verify them
we have
a here a dot data plus equals h
so the claim is that
a dot grad is six
let's verify
six
and we have beta data
plus equals h
so nudging b by h
and looking at what happens
we claim it's negative four
and indeed it's negative four plus minus
again float oddness
um
and uh
that's it this
that was the manual
back propagation
uh all the way from here to all the leaf
nodes and we've done it piece by piece
and really all we've done is as you saw
we iterated through all the nodes one by
one and locally applied the chain rule
we always know what is the derivative of
l with respect to this little output and
then we look at how this output was
produced this output was produced
through some operation and we have the
pointers to the children nodes of this
operation
and so in this little operation we know
what the local derivatives are and we
just multiply them onto the derivative
always
so we just go through and recursively
multiply on the local derivatives and
that's what back propagation is is just
a recursive application of chain rule
backwards through the computation graph
let's see this power in action just very
briefly what we're going to do is we're
going to
nudge our inputs to try to make l go up
so in particular what we're doing is we
want a.data we're going to change it
and if we want l to go up that means we
just have to go in the direction of the
gradient so
a
should increase in the direction of
gradient by like some small step amount
this is the step size
and we don't just want this for ba but
also for b
also for c
also for f
those are
leaf nodes which we usually have control
over
and if we nudge in direction of the
gradient we expect a positive influence
on l
so we expect l to go up
positively
so it should become less negative it
should go up to say negative you know
six or something like that
uh it's hard to tell exactly and we'd
have to rewrite the forward pass so let
me just um
do that here
um
this would be the forward pass f would
be unchanged this is effectively the
forward pass and now if we print l.data
we expect because we nudged all the
values all the inputs in the rational
gradient we expected a less negative l
we expect it to go up
so maybe it's negative six or so let's
see what happens
okay negative seven
and uh this is basically one step of an
optimization that we'll end up running
and really does gradient just give us
some power because we know how to
influence the final outcome and this
will be extremely useful for training
knowledge as well as you'll see
so now i would like to do one more uh
example of manual backpropagation using
a bit more complex and uh useful example
we are going to back propagate through a
neuron
so
we want to eventually build up neural
networks and in the simplest case these
are multilateral perceptrons as they're
called so this is a two layer neural net
and it's got these hidden layers made up
of neurons and these neurons are fully
connected to each other
now biologically neurons are very
complicated devices but we have very
simple mathematical models of them
and so this is a very simple
mathematical model of a neuron you have
some inputs axis
and then you have these synapses that
have weights on them so
the w's are weights
and then
the synapse interacts with the input to
this neuron multiplicatively so what
flows to the cell body
of this neuron is w times x
but there's multiple inputs so there's
many w times x's flowing into the cell
body
the cell body then has also like some
bias
so this is kind of like the
inert innate sort of trigger happiness
of this neuron so this bias can make it
a bit more trigger happy or a bit less
trigger happy regardless of the input
but basically we're taking all the w
times x
of all the inputs adding the bias and
then we take it through an activation
function
and this activation function is usually
some kind of a squashing function
like a sigmoid or 10h or something like
that so as an example
we're going to use the 10h in this
example
numpy has a
np.10h
so
we can call it on a range
and we can plot it
this is the 10h function and you see
that the inputs as they come in
get squashed on the y coordinate here so
um
right at zero we're going to get exactly
zero and then as you go more positive in
the input
then you'll see that the function will
only go up to one and then plateau out
and so if you pass in very positive
inputs we're gonna cap it smoothly at
one and on the negative side we're gonna
cap it smoothly to negative one
so that's 10h
and that's the squashing function or an
activation function and what comes out
of this neuron is just the activation
function applied to the dot product of
the weights and the
inputs
so let's
write one out
um
i'm going to copy paste because
i don't want to type too much
but okay so here we have the inputs
x1 x2 so this is a two-dimensional
neuron so two inputs are going to come
in
these are thought out as the weights of
this neuron
weights w1 w2 and these weights again
are the synaptic strengths for each
input
and this is the bias of the neuron
b
and now we want to do is according to
this model we need to multiply x1 times
w1
and x2 times w2
and then we need to add bias on top of
it
and it gets a little messy here but all
we are trying to do is x1 w1 plus x2 w2
plus b
and these are multiply here
except i'm doing it in small steps so
that we actually have pointers to all
these intermediate nodes so we have x1
w1 variable x times x2 w2 variable and
i'm also labeling them
so n is now
the cell body raw
raw
activation without
the activation function for now
and this should be enough to basically
plot it so draw dot of n
gives us x1 times w1 x2 times w2
being added
then the bias gets added on top of this
and this n
is this sum
so we're now going to take it through an
activation function
and let's say we use the 10h
so that we produce the output
so what we'd like to do here is we'd
like to do the output and i'll call it o
is um
n dot 10h
okay but we haven't yet written the 10h
now the reason that we need to implement
another 10h function here is that
tanh is a
hyperbolic function and we've only so
far implemented a plus and the times and
you can't make a 10h out of just pluses
and times
you also need exponentiation so 10h is
this kind of a formula here
you can use either one of these and you
see that there's exponentiation involved
which we have not implemented yet for
our low value node here so we're not
going to be able to produce 10h yet and
we have to go back up and implement
something like it
now one option here
is we could actually implement um
exponentiation
right and we could return the x of a
value instead of a 10h of a value
because if we had x then we have
everything else that we need so um
because we know how to add and we know
how to
um
we know how to add and we know how to
multiply so we'd be able to create 10h
if we knew how to x
but for the purposes of this example i
specifically wanted to
show you
that we don't necessarily need to have
the most atomic pieces
in
um
in this value object we can actually
like create functions at arbitrary
points of abstraction they can be
complicated functions but they can be
also very very simple functions like a
plus and it's totally up to us the only
thing that matters is that we know how
to differentiate through any one
function so we take some inputs and we
make an output the only thing that
matters it can be arbitrarily complex
function as long as you know how to
create the local derivative if you know
the local derivative of how the inputs
impact the output then that's all you
need so we're going to cluster up
all of this expression and we're not
going to break it down to its atomic
pieces we're just going to directly
implement tanh
so let's do that
depth nh
and then out will be a value
of
and we need this expression here so
um
let me actually
copy paste
let's grab n which is a cell.theta
and then this
i believe is the tan h
math.x of
two
no n
n minus one over
two n plus one
maybe i can call this x
just so that it matches exactly
okay and now
this will be t
and uh children of this node there's
just one child
and i'm wrapping it in a tuple so this
is a tuple of one object just self
and here the name of this operation will
be 10h
and we're going to return that
okay
so now valley should be implementing 10h
and now we can scroll all the way down
here
and we can actually do n.10 h and that's
going to return the tanhd
output of n
and now we should be able to draw it out
of o not of n
so let's see how that worked
there we go
n went through 10 h
to produce this output
so now tan h is a
sort of
our little micro grad supported node
here as an operation
and as long as we know the derivative of
10h
then we'll be able to back propagate
through it now let's see this 10h in
action currently it's not squashing too
much because the input to it is pretty
low so if the bias was increased to say
eight
then we'll see that what's flowing into
the 10h now is
two
and 10h is squashing it to 0.96 so we're
already hitting the tail of this 10h and
it will sort of smoothly go up to 1 and
then plateau out over there
okay so now i'm going to do something
slightly strange i'm going to change
this bias from 8 to this number
6.88 etc
and i'm going to do this for specific
reasons because we're about to start
back propagation
and i want to make sure that our numbers
come out nice they're not like very
crazy numbers they're nice numbers that
we can sort of understand in our head
let me also add a pose label
o is short for output here
so that's zero
okay so
0.88 flows into 10 h comes out 0.7 so on
so now we're going to do back
propagation and we're going to fill in
all the gradients
so what is the derivative o with respect
to
all the
inputs here and of course in the typical
neural network setting what we really
care about the most is the derivative of
these neurons on the weights
specifically the w2 and w1 because those
are the weights that we're going to be
changing part of the optimization
and the other thing that we have to
remember is here we have only a single
neuron but in the neural natives
typically have many neurons and they're
connected
so this is only like a one small neuron
a piece of a much bigger puzzle and
eventually there's a loss function that
sort of measures the accuracy of the
neural net and we're back propagating
with respect to that accuracy and trying
to increase it
so let's start off by propagation here
in the end
what is the derivative of o with respect
to o the base case sort of we know
always is that the gradient is just 1.0
so let me fill it in
and then let me
split out
the drawing function
here
and then here cell
clear this output here okay
so now when we draw o we'll see that oh
that grad is one
so now we're going to back propagate
through the tan h
so to back propagate through 10h we need
to know the local derivative of 10h
so if we have that
o is 10 h of
n
then what is d o by d n
now what you could do is you could come
here and you could take this expression
and you could
do your calculus derivative taking
um and that would work but we can also
just scroll down wikipedia here
into a section that hopefully tells us
that derivative uh
d by dx of 10 h of x is
any of these i like this one 1 minus 10
h square of x
so this is 1 minus 10 h
of x squared
so basically what this is saying is that
d o by d n
is
1 minus 10 h
of n
squared
and we already have 10 h of n that's
just o
so it's one minus o squared
so o is the output here so the output is
this number
data
is this number
and then
what this is saying is that do by dn is
1 minus
this squared so
one minus of that data squared
is 0.5 conveniently
so the local derivative of this 10 h
operation here is 0.5
and
so that would be d o by d n
so
we can fill in that in that grad
is 0.5 we'll just fill in
so this is exactly 0.5 one half
so now we're going to continue the back
propagation
this is 0.5 and this is a plus node
so how is backprop going to what is that
going to do here
and if you remember our previous example
a plus is just a distributor of gradient
so this gradient will simply flow to
both of these equally and that's because
the local derivative of this operation
is one for every one of its nodes so 1
times 0.5 is 0.5
so therefore we know that
this node here which we called this
its grad is just 0.5
and we know that b dot grad is also 0.5
so let's set those and let's draw
so 0.5
continuing we have another plus
0.5 again we'll just distribute it so
0.5 will flow to both of these
so we can set
theirs
x2w2 as well that grad is 0.5
and let's redraw pluses are my favorite
uh operations to back propagate through
because
it's very simple
so now it's flowing into these
expressions is 0.5 and so really again
keep in mind what the derivative is
telling us at every point in time along
here this is saying that
if we want the output of this neuron to
increase
then
the influence on these expressions is
positive on the output both of them are
positive
contribution to the output
so now back propagating to x2 and w2
first
this is a times node so we know that the
local derivative is you know the other
term
so if we want to calculate x2.grad
then
can you think through what it's going to
be
so x2.grad will be
w2.data
times this x2w2
by grad right
and
w2.grad will be
x2 that data times x2w2.grad
right so that's the local piece of chain
rule
let's set them and let's redraw
so here we see that the gradient on our
weight 2 is 0 because x2 data was 0
right but x2 will have the gradient 0.5
because data here was 1.
and so what's interesting here right is
because the input x2 was 0 then because
of the way the times works
of course this gradient will be zero and
think about intuitively why that is
derivative always tells us the influence
of
this on the final output if i wiggle w2
how is the output changing
it's not changing because we're
multiplying by zero
so because it's not changing there's no
derivative and zero is the correct
answer
because we're
squashing it at zero
and let's do it here point five should
come here and flow through this times
and so we'll have that x1.grad is
can you think through a little bit what
what
this should be
the local derivative of times
with respect to x1 is going to be w1
so w1 is data times
x1 w1 dot grad
and w1.grad will be x1.data times
x1 w2 w1 with graph
let's see what those came out to be
so this is 0.5 so this would be negative
1.5 and this would be 1.
and we've back propagated through this
expression these are the actual final
derivatives so if we want this neuron's
output to increase
we know that what's necessary is that
w2 we have no gradient w2 doesn't
actually matter to this neuron right now
but this neuron this weight should uh go
up
so if this weight goes up then this
neuron's output would have gone up and
proportionally because the gradient is
one okay so doing the back propagation
manually is obviously ridiculous so we
are now going to put an end to this
suffering and we're going to see how we
can implement uh the backward pass a bit
more automatically we're not going to be
doing all of it manually out here
it's now pretty obvious to us by example
how these pluses and times are back
property ingredients so let's go up to
the value
object and we're going to start
codifying what we've seen
in the examples below
so we're going to do this by storing a
special cell dot backward
and underscore backward and this will be
a function which is going to do that
little piece of chain rule at each
little node that compute that took
inputs and produced output uh we're
going to store
how we are going to chain the the
outputs gradient into the inputs
gradients
so by default
this will be a function
that uh doesn't do anything
so um
and you can also see that here in the
value in micrograb
so
with this backward function by default
doesn't do anything
this is an empty function
and that would be sort of the case for
example for a leaf node for leaf node
there's nothing to do
but now if when we're creating these out
values these out values are an addition
of self and other
and so we will want to sell set
outs backward to be
the function that propagates the
gradient
so
let's define what should happen
and we're going to store it in a closure
let's define what should happen when we
call
outs grad
for in addition
our job is to take
outs grad and propagate it into self's
grad and other grad so basically we want
to sell self.grad to something
and we want to set others.grad to
something
okay
and the way we saw below how chain rule
works we want to take the local
derivative times
the
sort of global derivative i should call
it which is the derivative of the final
output of the expression with respect to
out's data
with respect to out
so
the local derivative of self in an
addition is 1.0
so it's just 1.0 times
outs grad
that's the chain rule
and others.grad will be 1.0 times
outgrad
and what you basically what you're
seeing here is that outscrad
will simply be copied onto selfs grad
and others grad as we saw happens for an
addition operation
so we're going to later call this
function to propagate the gradient
having done an addition
let's now do multiplication we're going
to also define that backward
and we're going to set its backward to
be backward
and we want to chain outgrad into
self.grad
and others.grad
and this will be a little piece of chain
rule for multiplication
so we'll have
so what should this be
can you think through
so what is the local derivative
here the local derivative was
others.data
and then
oops others.data and the times of that
grad that's channel
and here we have self.data times of that
grad
that's what we've been doing
and finally here for 10 h
left backward
and then we want to set out backwards to
be just backward
and here we need to
back propagate we have out that grad and
we want to chain it into self.grad
and salt.grad will be
the local derivative of this operation
that we've done here which is 10h
and so we saw that the local the
gradient is 1 minus the tan h of x
squared which here is t
that's the local derivative because
that's t is the output of this 10 h so 1
minus t squared is the local derivative
and then gradient um
has to be multiplied because of the
chain rule
so outgrad is chained through the local
gradient into salt.grad
and that should be basically it so we're
going to redefine our value node
we're going to swing all the way down
here
and we're going to
redefine
our expression
make sure that all the grads are zero
okay
but now we don't have to do this
manually anymore
we are going to basically be calling the
dot backward in the right order
so
first we want to call os
dot backwards
so o was the outcome of 10h
right so calling all that those who's
backward
will be
this function this is what it will do
now we have to be careful because
there's a times out.grad
and out.grad remember is initialized to
zero
so here we see grad zero so as a base
case we need to set both.grad to 1.0
to initialize this with 1
and then once this is 1 we can call oda
backward
and what that should do is it should
propagate this grad through 10h
so the local derivative times
the global derivative which is
initialized at one so
this should
um
a dope
so i thought about redoing it but i
figured i should just leave the error in
here because it's pretty funny why is
anti-object not callable
uh it's because
i screwed up we're trying to save these
functions so this is correct
this here
we don't want to call the function
because that returns none these
functions return none we just want to
store the function
so let me redefine the value object
and then we're going to come back in
redefine the expression draw a dot
everything is great o dot grad is one
o dot grad is one and now
now this should work of course
okay so all that backward should
this grant should now be 0.5 if we
redraw and if everything went correctly
0.5 yay
okay so now we need to call ns.grad
and it's not awkward sorry
ends backward
so that seems to have worked
so instead backward routed the gradient
to both of these so this is looking
great
now we could of course called uh called
b grad
beat up backwards sorry
what's gonna happen
well b doesn't have it backward b is
backward
because b is a leaf node
b's backward is by initialization the
empty function
so nothing would happen but we can call
call it on it
but when we call
this one
it's backward
then we expect this 0.5 to get further
routed
right so there we go 0.5.5
and then finally
we want to call
it here on x2 w2
and on x1 w1
do both of those
and there we go
so we get 0 0.5 negative 1.5 and 1
exactly as we did before but now
we've done it through
calling that backward um
sort of manually
so we have the lamp one last piece to
get rid of which is us calling
underscore backward manually so let's
think through what we are actually doing
um
we've laid out a mathematical expression
and now we're trying to go backwards
through that expression
um so going backwards through the
expression just means that we never want
to call a dot backward for any node
before
we've done a sort of um everything after
it
so we have to do everything after it
before we're ever going to call that
backward on any one node we have to get
all of its full dependencies everything
that it depends on has to
propagate to it before we can continue
back propagation so this ordering of
graphs can be achieved using something
called topological sort
so topological sort
is basically a laying out of a graph
such that all the edges go only from
left to right basically
so here we have a graph it's a directory
a cyclic graph a dag
and this is two different topological
orders of it i believe where basically
you'll see that it's laying out of the
notes such that all the edges go only
one way from left to right
and implementing topological sort you
can look in wikipedia and so on i'm not
going to go through it in detail
but basically this is what builds a
topological graph
we maintain a set of visited nodes and
then we are
going through starting at some root node
which for us is o that's where we want
to start the topological sort
and starting at o we go through all of
its children and we need to lay them out
from left to right
and basically this starts at o
if it's not visited then it marks it as
visited and then it iterates through all
of its children
and calls build topological on them
and then uh after it's gone through all
the children it adds itself
so basically
this node that we're going to call it on
like say o is only going to add itself
to the topo list after all of the
children have been processed and that's
how this function is guaranteeing
that you're only going to be in the list
once all your children are in the list
and that's the invariant that is being
maintained so if we built upon o and
then inspect this list
we're going to see that it ordered our
value objects
and the last one
is the value of 0.707 which is the
output
so this is o and then this is n
and then all the other nodes get laid
out before it
so that builds the topological graph and
really what we're doing now is we're
just calling dot underscore backward on
all of the nodes in a topological order
so if we just reset the gradients
they're all zero
what did we do
we started by
setting o dot grad
to b1
that's the base case
then we built the topological order
and then we went for node
in
reversed
of topo
now
in in the reverse order because this
list goes from
you know we need to go through it in
reversed order
so starting at o
note that backward
and this should be
it
there we go
those are the correct derivatives
finally we are going to hide this
functionality
so i'm going to
copy this and we're going to hide it
inside the valley class because we don't
want to have all that code lying around
so instead of an underscore backward
we're now going to define an actual
backward so that's backward without the
underscore
and that's going to do all the stuff
that we just arrived
so let me just clean this up a little
bit so
we're first going to
build a topological graph
starting at self
so build topo of self
will populate the topological order into
the topo list which is a local variable
then we set self.grad to be one
and then for each node in the reversed
list so starting at us and going to all
the children
underscore backward
and
that should be it so
save
come down here
redefine
[Music]
okay all the grands are zero
and now what we can do is oh that
backward without the underscore
and
there we go
and that's uh that's back propagation
place for one neuron
now we shouldn't be too happy with
ourselves actually because we have a bad
bug um and we have not surfaced the bug
because of some specific conditions that
we are we have to think about right now
so here's the simplest case that shows
the bug
say i create a single node a
and then i create a b that is a plus a
and then i called backward
so what's going to happen is a is 3
and then a b is a plus a so there's two
arrows on top of each other here
then we can see that b is of course the
forward pass works
b is just
a plus a which is six
but the gradient here is not actually
correct
that we calculate it automatically
and that's because
um
of course uh
just doing calculus in your head the
derivative of b with respect to a
should be uh two
one plus one
it's not one
intuitively what's happening here right
so b is the result of a plus a and then
we call backward on it
so let's go up and see what that does
um
b is a result of addition
so out as
b and then when we called backward what
happened is
self.grad was set
to one
and then other that grad was set to one
but because we're doing a plus a
self and other are actually the exact
same object
so we are overriding the gradient we are
setting it to one and then we are
setting it again to one and that's why
it stays
at one
so that's a problem
there's another way to see this in a
little bit more complicated expression
so here we have
a and b
and then uh d will be the multiplication
of the two and e will be the addition of
the two
and
then we multiply e times d to get f and
then we called fda backward
and these gradients if you check will be
incorrect
so fundamentally what's happening here
again is
basically we're going to see an issue
anytime we use a variable more than once
until now in these expressions above
every variable is used exactly once so
we didn't see the issue
but here if a variable is used more than
once what's going to happen during
backward pass we're backpropagating from
f to e to d so far so good but now
equals it backward and it deposits its
gradients to a and b but then we come
back to d
and call backward and it overwrites
those gradients at a and b
so that's obviously a problem
and the solution here if you look at
the multivariate case of the chain rule
and its generalization there
the solution there is basically that we
have to accumulate these gradients these
gradients add
and so instead of setting those
gradients
we can simply do plus equals we need to
accumulate those gradients
plus equals plus equals
plus equals
plus equals
and this will be okay remember because
we are initializing them at zero so they
start at zero
and then any
contribution
that flows backwards
will simply add
so now if we redefine
this one
because the plus equals this now works
because a.grad started at zero and we
called beta backward we deposit one and
then we deposit one again and now this
is two which is correct
and here this will also work and we'll
get correct gradients
because when we call eta backward we
will deposit the gradients from this
branch and then we get to back into
detail backward it will deposit its own
gradients and then those gradients
simply add on top of each other and so
we just accumulate those gradients and
that fixes the issue okay now before we
move on let me actually do a bit of
cleanup here and delete some of these
some of this intermediate work so
we're not gonna need any of this now
that we've derived all of it
um
we are going to keep this because i want
to come back to it
delete the 10h
delete our morning example
delete the step
delete this keep the code that draws
and then delete this example
and leave behind only the definition of
value
and now let's come back to this
non-linearity here that we implemented
the tanh now i told you that we could
have broken down 10h into its explicit
atoms in terms of other expressions if
we had the x function so if you remember
tan h is defined like this and we chose
to develop tan h as a single function
and we can do that because we know its
derivative and we can back propagate
through it
but we can also break down tan h into
and express it as a function of x and i
would like to do that now because i want
to prove to you that you get all the
same results and all those ingredients
but also because it forces us to
implement a few more expressions it
forces us to do exponentiation addition
subtraction division and things like
that and i think it's a good exercise to
go through a few more of these
okay so let's scroll up
to the definition of value
and here one thing that we currently
can't do is we can do like a value of
say 2.0
but we can't do you know here for
example we want to add constant one and
we can't do something like this
and we can't do it because it says
object has no attribute data that's
because a plus one comes right here to
add
and then other is the integer one and
then here python is trying to access
one.data and that's not a thing and
that's because basically one is not a
value object and we only have addition
for value objects so as a matter of
convenience so that we can create
expressions like this and make them make
sense
we can simply do something like this
basically
we let other alone if other is an
instance of value but if it's not an
instance of value we're going to assume
that it's a number like an integer float
and we're going to simply wrap it in in
value and then other will just become
value of other and then other will have
a data attribute and this should work so
if i just say this predefined value then
this should work
there we go okay now let's do the exact
same thing for multiply because we can't
do something like this
again
for the exact same reason so we just
have to go to mole and if other is
not a value then let's wrap it in value
let's redefine value and now this works
now here's a kind of unfortunate and not
obvious part a times two works we saw
that but two times a is that gonna work
you'd expect it to right but actually it
will not
and the reason it won't is because
python doesn't know
like when when you do a times two
basically um so a times two python will
go and it will basically do something
like a dot mul
of two that's basically what it will
call but to it 2 times a is the same as
2 dot mol of a
and it doesn't 2 can't multiply
value and so it's really confused about
that
so instead what happens is in python the
way this works is you are free to define
something called the r mold
and our mole
is kind of like a fallback so if python
can't do 2 times a it will check if um
if by any chance a knows how to multiply
two and that will be called into our
mole
so because python can't do two times a
it will check is there an our mole in
value and because there is it will now
call that
and what we'll do here is we will swap
the order of the operands so basically
two times a will redirect to armel and
our mole will basically call a times two
and that's how that will work
so
redefining now with armor two times a
becomes four okay now looking at the
other elements that we still need we
need to know how to exponentiate and how
to divide so let's first the explanation
to the exponentiation part we're going
to introduce
a single
function x here
and x is going to mirror 10h in the
sense that it's a simple single function
that transforms a single scalar value
and outputs a single scalar value
so we pop out the python number we use
math.x to exponentiate it create a new
value object
everything that we've seen before the
tricky part of course is how do you
propagate through e to the x
and
so here you can potentially pause the
video and think about what should go
here
okay so basically we need to know what
is the local derivative of e to the x so
d by d x of e to the x is famously just
e to the x and we've already just
calculated e to the x and it's inside
out that data so we can do up that data
times
and
out that grad that's the chain rule
so we're just chaining on to the current
running grad
and this is what the expression looks
like it looks a little confusing but
this is what it is and that's the
exponentiation
so redefining we should now be able to
call a.x
and
hopefully the backward pass works as
well okay and the last thing we'd like
to do of course is we'd like to be able
to divide
now
i actually will implement something
slightly more powerful than division
because division is just a special case
of something a bit more powerful
so in particular just by rearranging
if we have some kind of a b equals
value of 4.0 here we'd like to basically
be able to do a divide b and we'd like
this to be able to give us 0.5
now division actually can be reshuffled
as follows if we have a divide b that's
actually the same as a multiplying one
over b
and that's the same as a multiplying b
to the power of negative one
and so what i'd like to do instead is i
basically like to implement the
operation of x to the k for some
constant uh k so it's an integer or a
float um and we would like to be able to
differentiate this and then as a special
case uh negative one will be division
and so i'm doing that just because uh
it's more general and um yeah you might
as well do it that way so basically what
i'm saying is we can redefine
uh division
which we will put here somewhere
yeah we can put it here somewhere what
i'm saying is that we can redefine
division so self-divide other
can actually be rewritten as self times
other to the power of negative one
and now
a value raised to the power of negative
one we have now defined that
so
here's
so we need to implement the pow function
where am i going to put the power
function maybe here somewhere
this is the skeleton for it
so this function will be called when we
try to raise a value to some power and
other will be that power
now i'd like to make sure that other is
only an int or a float usually other is
some kind of a different value object
but here other will be forced to be an
end or a float otherwise the math
won't work for
for or try to achieve in the specific
case that would be a different
derivative expression if we wanted other
to be a value
so here we create the output value which
is just uh you know this data raised to
the power of other and other here could
be for example negative one that's what
we are hoping to achieve
and then uh this is the backwards stub
and this is the fun part which is what
is the uh chain rule expression here for
back for um
back propagating through the power
function where the power is to the power
of some kind of a constant
so this is the exercise and maybe pause
the video here and see if you can figure
it out yourself as to what we should put
here
okay so
you can actually go here and look at
derivative rules as an example and we
see lots of derivatives that you can
hopefully know from calculus in
particular what we're looking for is the
power rule
because that's telling us that if we're
trying to take d by dx of x to the n
which is what we're doing here
then that is just n times x to the n
minus 1
right
okay
so
that's telling us about the local
derivative of this power operation
so all we want here
basically n is now other
and self.data is x
and so this now becomes
other which is n times
self.data
which is now a python in torah float
it's not a valley object we're accessing
the data attribute
raised
to the power of other minus one or n
minus one
i can put brackets around this but this
doesn't matter because
power takes precedence over multiply and
python so that would have been okay
and that's the local derivative only but
now we have to chain it and we change
just simply by multiplying by output
grad that's chain rule
and this should technically work
and we're going to find out soon but now
if we
do this this should now work
and we get 0.5 so the forward pass works
but does the backward pass work and i
realize that we actually also have to
know how to subtract so
right now a minus b will not work
to make it work we need one more
piece of code here
and
basically this is the
subtraction and the way we're going to
implement subtraction is we're going to
implement it by addition of a negation
and then to implement negation we're
gonna multiply by negative one so just
again using the stuff we've already
built and just um expressing it in terms
of what we have and a minus b is now
working okay so now let's scroll again
to this expression here for this neuron
and let's just
compute the backward pass here once
we've defined o
and let's draw it
so here's the gradients for all these
leaf nodes for this two-dimensional
neuron that has a 10h that we've seen
before so now what i'd like to do is i'd
like to break up this 10h
into this expression here
so let me copy paste this
here
and now instead of we'll preserve the
label
and we will change how we define o
so in particular we're going to
implement this formula here
so we need e to the 2x
minus 1 over e to the x plus 1. so e to
the 2x we need to take 2 times n and we
need to exponentiate it that's e to the
two x and then because we're using it
twice let's create an intermediate
variable e
and then define o as
e plus one over
e minus one over e plus one
e minus one over e plus one
and that should be it and then we should
be able to draw that of o
so now before i run this what do we
expect to see
number one we're expecting to see a much
longer
graph here because we've broken up 10h
into a bunch of other operations
but those operations are mathematically
equivalent and so what we're expecting
to see is number one the same
result here so the forward pass works
and number two because of that
mathematical equivalence we expect to
see the same backward pass and the same
gradients on these leaf nodes so these
gradients should be identical
so let's run this
so number one let's verify that instead
of a single 10h node we have now x and
we have plus we have times negative one
uh this is the division
and we end up with the same forward pass
here
and then the gradients we have to be
careful because they're in slightly
different order potentially the
gradients for w2x2 should be 0 and 0.5
w2 and x2 are 0 and 0.5
and w1 x1 are 1 and negative 1.5
1 and negative 1.5
so that means that both our forward
passes and backward passes were correct
because this turned out to be equivalent
to
10h before
and so the reason i wanted to go through
this exercise is number one we got to
practice a few more operations and uh
writing more backwards passes and number
two i wanted to illustrate the point
that
the um
the level at which you implement your
operations is totally up to you you can
implement backward passes for tiny
expressions like a single individual
plus or a single times
or you can implement them for say
10h
which is a kind of a potentially you can
see it as a composite operation because
it's made up of all these more atomic
operations but really all of this is
kind of like a fake concept all that
matters is we have some kind of inputs
and some kind of an output and this
output is a function of the inputs in
some way and as long as you can do
forward pass and the backward pass of
that little operation it doesn't matter
what that operation is
and how composite it is
if you can write the local gradients you
can chain the gradient and you can
continue back propagation so the design
of what those functions are is
completely up to you
so now i would like to show you how you
can do the exact same thing by using a
modern deep neural network library like
for example pytorch which i've roughly
modeled micrograd
by
and so
pytorch is something you would use in
production and i'll show you how you can
do the exact same thing but in pytorch
api so i'm just going to copy paste it
in and walk you through it a little bit
this is what it looks like
so we're going to import pi torch and
then we need to define these
value objects like we have here
now micrograd is a scalar valued
engine so we only have scalar values
like 2.0 but in pi torch everything is
based around tensors and like i
mentioned tensors are just n-dimensional
arrays of scalars
so that's why things get a little bit
more complicated here i just need a
scalar value to tensor a tensor with
just a single element
but by default when you work with
pytorch you would use um
more complicated tensors like this so if
i import pytorch
then i can create tensors like this and
this tensor for example is a two by
three array
of scalar
scalars
in a single compact representation so we
can check its shape we see that it's a
two by three array
and so on
so this is usually what you would work
with um in the actual libraries so here
i'm creating
a tensor that has only a single element
2.0
and then i'm casting it to be double
because python is by default using
double precision for its floating point
numbers so i'd like everything to be
identical by default the data type of
these tensors will be float32 so it's
only using a single precision float so
i'm casting it to double
so that we have float64 just like in
python
so i'm casting to double and then we get
something similar to value of two the
next thing i have to do is because these
are leaf nodes by default pytorch
assumes that they do not require
gradients so i need to explicitly say
that all of these nodes require
gradients
okay so this is going to construct
scalar valued one element tensors
make sure that fighters knows that they
require gradients now by default these
are set to false by the way because of
efficiency reasons because usually you
would not want gradients for leaf nodes
like the inputs to the network and this
is just trying to be efficient in the
most common cases
so once we've defined all of our values
in python we can perform arithmetic just
like we can here in microgradlend so
this will just work and then there's a
torch.10h also
and when we get back is a tensor again
and we can
just like in micrograd it's got a data
attribute and it's got grant attributes
so these tensor objects just like in
micrograd have a dot data and a dot grad
and
the only difference here is that we need
to call it that item because otherwise
um pi torch
that item basically takes
a single tensor of one element and it
just returns that element stripping out
the tensor
so let me just run this and hopefully we
are going to get this is going to print
the forward pass
which is 0.707
and this will be the gradients which
hopefully are
0.5 0 negative 1.5 and 1.
so if we just run this
there we go
0.7 so the forward pass agrees and then
point five zero negative one point five
and one
so pi torch agrees with us
and just to show you here basically o
here's a tensor with a single element
and it's a double
and we can call that item on it to just
get the single number out
so that's what item does and o is a
tensor object like i mentioned and it's
got a backward function just like we've
implemented
and then all of these also have a dot
graph so like x2 for example in the grad
and it's a tensor and we can pop out the
individual number with that actin
so basically
torches torch can do what we did in
micrograph is a special case when your
tensors are all single element tensors
but the big deal with pytorch is that
everything is significantly more
efficient because we are working with
these tensor objects and we can do lots
of operations in parallel on all of
these tensors
but otherwise what we've built very much
agrees with the api of pytorch
okay so now that we have some machinery
to build out pretty complicated
mathematical expressions we can also
start building out neural nets and as i
mentioned neural nets are just a
specific class of mathematical
expressions
so we're going to start building out a
neural net piece by piece and eventually
we'll build out a two-layer multi-layer
layer perceptron as it's called and i'll
show you exactly what that means
let's start with a single individual
neuron we've implemented one here but
here i'm going to implement one that
also subscribes to the pytorch api in
how it designs its neural network
modules
so just like we saw that we can like
match the api of pytorch
on the auto grad side we're going to try
to do that on the neural network modules
so here's class neuron
and just for the sake of efficiency i'm
going to copy paste some sections that
are relatively straightforward
so the constructor will take
number of inputs to this neuron which is
how many inputs come to a neuron so this
one for example has three inputs
and then it's going to create a weight
there is some random number between
negative one and one for every one of
those inputs
and a bias that controls the overall
trigger happiness of this neuron
and then we're going to implement a def
underscore underscore call
of self and x some input x
and really what we don't do here is w
times x plus b
where w times x here is a dot product
specifically
now if you haven't seen
call
let me just return 0.0 here for now the
way this works now is we can have an x
which is say like 2.0 3.0 then we can
initialize a neuron that is
two-dimensional
because these are two numbers and then
we can feed those two numbers into that
neuron to get an output
and so when you use this notation n of x
python will use call
so currently call just return 0.0
now we'd like to actually do the forward
pass of this neuron instead
so we're going to do here first is we
need to basically multiply all of the
elements of w with all of the elements
of x pairwise we need to multiply them
so the first thing we're going to do is
we're going to zip up
celta w and x
and in python zip takes two iterators
and it creates a new iterator that
iterates over the tuples of the
corresponding entries
so for example just to show you we can
print this list
and still return 0.0 here
sorry
so we see that these w's are paired up
with the x's w with x
and now what we want to do is
for w i x i in
we want to multiply w times
w wi times x i
and then we want to sum all of that
together
to come up with an activation
and add also subnet b on top
so that's the raw activation and then of
course we need to pass that through a
non-linearity so what we're going to be
returning is act.10h
and here's out
so
now we see that we are getting some
outputs and we get a different output
from a neuron each time because we are
initializing different weights and by
and biases
and then to be a bit more efficient here
actually sum by the way takes a second
optional parameter which is the start
and by default the start is zero so
these elements of this sum will be added
on top of zero to begin with but
actually we can just start with cell dot
b
and then we just have an expression like
this
and then the generator expression here
must be parenthesized in python
there we go
yep so now we can forward a single
neuron next up we're going to define a
layer of neurons so here we have a
schematic for a mlb
so we see that these mlps each layer
this is one layer has actually a number
of neurons and they're not connected to
each other but all of them are fully
connected to the input
so what is a layer of neurons it's just
it's just a set of neurons evaluated
independently
so
in the interest of time i'm going to do
something fairly straightforward here
it's um
literally a layer is just a list of
neurons
and then how many neurons do we have we
take that as an input argument here how
many neurons do you want in your layer
number of outputs in this layer
and so we just initialize completely
independent neurons with this given
dimensionality and when we call on it we
just independently
evaluate them so now instead of a neuron
we can make a layer of neurons they are
two-dimensional neurons and let's have
three of them
and now we see that we have three
independent evaluations of three
different neurons
right
okay finally let's complete this picture
and define an entire multi-layer
perceptron or mlp
and as we can see here in an mlp these
layers just feed into each other
sequentially
so let's come here and i'm just going to
copy the code here in interest of time
so an mlp is very similar
we're taking the number of inputs
as before but now instead of taking a
single n out which is number of neurons
in a single layer we're going to take a
list of an outs and this list defines
the sizes of all the layers that we want
in our mlp
so here we just put them all together
and then iterate over consecutive pairs
of these sizes and create layer objects
for them
and then in the call function we are
just calling them sequentially so that's
an mlp really
and let's actually re-implement this
picture so we want three input neurons
and then two layers of four and an
output unit
so
we want
a three-dimensional input say this is an
example input we want three inputs into
two layers of four and one output
and this of course is an mlp
and there we go that's a forward pass of
an mlp
to make this a little bit nicer you see
how we have just a single element but
it's wrapped in a list because layer
always returns lists
circle for convenience
return outs at zero if len out is
exactly a single element
else return fullest
and this will allow us to just get a
single value out at the last layer that
only has a single neuron
and finally we should be able to draw
dot of n of x
and
as you might imagine
these expressions are now getting
relatively involved
so this is an entire mlp that we're
defining now
all the way until a single output
okay
and so obviously you would never
differentiate on pen and paper these
expressions but with micrograd we will
be able to back propagate all the way
through this
and back propagate
into
these weights of all these neurons so
let's see how that works okay so let's
create ourselves a very simple
example data set here
so this data set has four examples
and so we have four possible
inputs into the neural net
and we have four desired targets so we'd
like the neural net to assign
or output 1.0 when it's fed this example
negative one when it's fed these
examples and one when it's fed this
example so it's a very simple binary
classifier neural net basically that we
would like here
now let's think what the neural net
currently thinks about these four
examples we can just get their
predictions
um basically we can just call n of x for
x in axis
and then we can
so these are the outputs of the neural
net on those four examples
so
the first one is 0.91 but we'd like it
to be one so we should push this one
higher this one we want to be higher
this one says 0.88 and we want this to
be negative one
this is 0.8 we want it to be negative
one
and this one is 0.8 we want it to be one
so how do we make the neural net and how
do we tune the weights
to
better predict the desired targets
and the trick used in deep learning to
achieve this is to
calculate a single number that somehow
measures the total performance of your
neural net and we call this single
number the loss
so the loss
first
is is a single number that we're going
to define that basically measures how
well the neural net is performing right
now we have the intuitive sense that
it's not performing very well because
we're not very much close to this
so the loss will be high and we'll want
to minimize the loss
so in particular in this case what we're
going to do is we're going to implement
the mean squared error loss
so this is doing is we're going to
basically iterate um
for y ground truth
and y output in zip of um
wise and white red so we're going to
pair up the
ground truths with the predictions
and this zip iterates over tuples of
them
and for each
y ground truth and y output we're going
to subtract them
and square them
so let's first see what these losses are
these are individual loss components
and so basically for each
one of the four
we are taking the prediction and the
ground truth we are subtracting them and
squaring them
so because
this one is so close to its target 0.91
is almost one
subtracting them gives a very small
number
so here we would get like a negative
point one and then squaring it
just makes sure
that regardless of whether we are more
negative or more positive we always get
a positive
number instead of squaring we should we
could also take for example the absolute
value we need to discard the sign
and so you see that the expression is
ranged so that you only get zero exactly
when y out is equal to y ground truth
when those two are equal so your
prediction is exactly the target you are
going to get zero
and if your prediction is not the target
you are going to get some other number
so here for example we are way off and
so that's why the loss is quite high
and the more off we are the greater the
loss will be
so we don't want high loss we want low
loss
and so the final loss here will be just
the sum
of all of these
numbers
so you see that this should be zero
roughly plus zero roughly
but plus
seven
so loss should be about seven
here
and now we want to minimize the loss we
want the loss to be low
because if loss is low
then every one of the predictions is
equal to its target
so the loss the lowest it can be is zero
and the greater it is the worse off the
neural net is predicting
so now of course if we do lost that
backward
something magical happened when i hit
enter
and the magical thing of course that
happened is that we can look at
end.layers.neuron and that layers at say
like the the first layer
that neurons at zero
because remember that mlp has the layers
which is a list
and each layer has a neurons which is a
list and that gives us an individual
neuron
and then it's got some weights
and so we can for example look at the
weights at zero
um
oops it's not called weights it's called
w
and that's a value but now this value
also has a groud because of the backward
pass
and so we see that because this gradient
here on this particular weight of this
particular neuron of this particular
layer is negative
we see that its influence on the loss is
also negative so slightly increasing
this particular weight of this neuron of
this layer would make the loss go down
and we actually have this information
for every single one of our neurons and
all their parameters actually it's worth
looking at also the draw dot loss by the
way
so previously we looked at the draw dot
of a single neural neuron forward pass
and that was already a large expression
but what is this expression we actually
forwarded
every one of those four examples and
then we have the loss on top of them
with the mean squared error
and so this is a really massive graph
because this graph that we've built up
now
oh my gosh this graph that we've built
up now
which is kind of excessive it's
excessive because it has four forward
passes of a neural net for every one of
the examples and then it has the loss on
top
and it ends with the value of the loss
which was 7.12
and this loss will now back propagate
through all the four forward passes all
the way through just every single
intermediate value of the neural net
all the way back to of course the
parameters of the weights which are the
input
so these weight parameters here are
inputs to this neural net
and
these numbers here these scalars are
inputs to the neural net
so if we went around here
we'll probably find
some of these examples this 1.0
potentially maybe this 1.0 or you know
some of the others and you'll see that
they all have gradients as well
the thing is these gradients on the
input data are not that useful to us
and that's because the input data seems
to be not changeable it's it's a given
to the problem and so it's a fixed input
we're not going to be changing it or
messing with it even though we do have
gradients for it
but some of these gradients here
will be for the neural network
parameters the ws and the bs and those
we of course we want to change
okay so now we're going to want some
convenience code to gather up all of the
parameters of the neural net so that we
can operate on all of them
simultaneously and every one of them we
will nudge a tiny amount
based on the gradient information
so let's collect the parameters of the
neural net all in one array
so let's create a parameters of self
that just
returns celta w which is a list
concatenated with
a list of self.b
so this will just return a list
list plus list just you know gives you a
list
so that's parameters of neuron and i'm
calling it this way because also pi
torch has a parameters on every single
and in module
and uh it does exactly what we're doing
here it just returns the
parameter tensors for us as the
parameter scalars
now layer is also a module so it will
have parameters
itself
and basically what we want to do here is
something like this like
params is here and then for
neuron in salt out neurons
we want to get neuron.parameters
and we want to params.extend
right so these are the parameters of
this neuron and then we want to put them
on top of params so params dot extend
of peace
and then we want to return brands
so this is way too much code so actually
there's a way to simplify this which is
return
p
for neuron in self
neurons
for
p in neuron dot parameters
so it's a single list comprehension in
python you can sort of nest them like
this and you can um
then create
uh the desired
array so this is these are identical
we can take this out
and then let's do the same here
def parameters
self
and return
a parameter for layer in self dot layers
for
p in layer dot parameters
and that should be good
now let me pop out this so
we don't re-initialize our network
because we need to re-initialize
our
okay so unfortunately we will have to
probably re-initialize the network
because we just add functionality
because this class of course we i want
to get all the and that parameters but
that's not going to work because this is
the old class
okay
so unfortunately we do have to
reinitialize the network which will
change some of the numbers
but let me do that so that we pick up
the new api we can now do in the
parameters
and these are all the weights and biases
inside the entire neural net
so in total this mlp has 41 parameters
and
now we'll be able to change them
if we recalculate the loss here we see
that unfortunately we have slightly
different
predictions and slightly different laws
but that's okay
okay so we see that this neurons
gradient is slightly negative we can
also look at its data right now
which is 0.85 so this is the current
value of this neuron and this is its
gradient on the loss
so what we want to do now is we want to
iterate for every p in
n dot parameters so for all the 41
parameters in this neural net
we actually want to change p data
slightly
according to the gradient information
okay so
dot dot to do here
but this will be basically a tiny update
in this gradient descent scheme in
gradient descent we are thinking of the
gradient as a vector pointing in the
direction
of
increased
loss
and so
in gradient descent we are modifying
p data
by a small step size in the direction of
the gradient so the step size as an
example could be like a very small
number like 0.01 is the step size times
p dot grad
right
but we have to think through some of the
signs here
so uh
in particular working with this specific
example here
we see that if we just left it like this
then this neuron's value
would be currently increased by a tiny
amount of the gradient
the grain is negative so this value of
this neuron would go slightly down it
would become like 0.8 you know four or
something like that
but if this neuron's value goes lower
that would actually
increase the loss
that's because
the derivative of this neuron is
negative so increasing
this makes the loss go down so
increasing it is what we want to do
instead of decreasing it so basically
what we're missing here is we're
actually missing a negative sign
and again this other interpretation
and that's because we want to minimize
the loss we don't want to maximize the
loss we want to decrease it
and the other interpretation as i
mentioned is you can think of the
gradient vector
so basically just the vector of all the
gradients
as pointing in the direction of
increasing
the loss but then we want to decrease it
so we actually want to go in the
opposite direction
and so you can convince yourself that
this sort of plug does the right thing
here with the negative because we want
to minimize the loss
so if we nudge all the parameters by
tiny amount
then we'll see that
this data will have changed a little bit
so now this neuron
is a tiny amount greater
value so 0.854 went to 0.857
and that's a good thing because slightly
increasing this neuron
uh
data makes the loss go down according to
the gradient and so the correct thing
has happened sign wise
and so now what we would expect of
course is that
because we've changed all these
parameters we expect that the loss
should have gone down a bit
so we want to re-evaluate the loss let
me basically
this is just a data definition that
hasn't changed but the forward pass here
of the network we can recalculate
and actually let me do it outside here
so that we can compare the two loss
values
so here if i recalculate the loss
we'd expect the new loss now to be
slightly lower than this number so
hopefully what we're getting now is a
tiny bit lower than 4.84
4.36
okay and remember the way we've arranged
this is that low loss means that our
predictions are matching the targets so
our predictions now are probably
slightly closer to the
targets and now all we have to do is we
have to iterate this process
so again um we've done the forward pass
and this is the loss
now we can lost that backward
let me take these out and we can do a
step size
and now we should have a slightly lower
loss 4.36 goes to 3.9
and okay so
we've done the forward pass here's the
backward pass
nudge
and now the loss is 3.66
3.47
and you get the idea we just continue
doing this and this is uh gradient
descent we're just iteratively doing
forward pass backward pass update
forward pass backward pass update and
the neural net is improving its
predictions
so here if we look at why pred now
like red
we see that um
this value should be getting closer to
one
so this value should be getting more
positive these should be getting more
negative and this one should be also
getting more positive so if we just
iterate this
a few more times
actually we may be able to afford go to
go a bit faster let's try a slightly
higher learning rate
oops okay there we go so now we're at
0.31
if you go too fast by the way if you try
to make it too big of a step you may
actually overstep
it's overconfidence because again
remember we don't actually know exactly
about the loss function the loss
function has all kinds of structure and
we only know about the very local
dependence of all these parameters on
the loss but if we step too far
we may step into you know a part of the
loss that is completely different
and that can destabilize training and
make your loss actually blow up even
so the loss is now 0.04 so actually the
predictions should be really quite close
let's take a look
so you see how this is almost one
almost negative one almost one we can
continue going
uh so
yep backward
update
oops there we go so we went way too fast
and um
we actually overstepped
so we got two uh too eager where are we
now oops
okay
seven e negative nine so this is very
very low loss
and the predictions
are basically perfect
so somehow we
basically we were doing way too big
updates and we briefly exploded but then
somehow we ended up getting into a
really good spot so usually this
learning rate and the tuning of it is a
subtle art you want to set your learning
rate if it's too low you're going to
take way too long to converge but if
it's too high the whole thing gets
unstable and you might actually even
explode the loss
depending on your loss function
so finding the step size to be just
right it's it's a pretty subtle art
sometimes when you're using sort of
vanilla gradient descent
but we happen to get into a good spot we
can look at
n-dot parameters
so this is the setting of weights and
biases
that makes our network
predict
the desired targets
very very close
and
basically we've successfully trained
neural net
okay let's make this a tiny bit more
respectable and implement an actual
training loop and what that looks like
so this is the data definition that
stays this is the forward pass
um so
for uh k in range you know we're going
to
take a bunch of steps
first you do the forward pass
we validate the loss
let's re-initialize the neural net from
scratch
and here's the data
and we first do before pass then we do
the backward pass
and then we do an update that's gradient
descent
and then we should be able to iterate
this and we should be able to print the
current step
the current loss um let's just print the
sort of
number of the loss
and
that should be it
and then the learning rate 0.01 is a
little too small 0.1 we saw is like a
little bit dangerously too high let's go
somewhere in between
and we'll optimize this for
not 10 steps but let's go for say 20
steps
let me erase all of this junk
and uh let's run the optimization
and you see how we've actually converged
slower in a more controlled manner and
got to a loss that is very low
so
i expect white bread to be quite good
there we go
um
and
that's it
okay so this is kind of embarrassing but
we actually have a really terrible bug
in here and it's a subtle bug and it's a
very common bug and i can't believe i've
done it for the 20th time in my life
especially on camera and i could have
reshot the whole thing but i think it's
pretty funny and you know you get to
appreciate a bit what um working with
neural nets maybe
is like sometimes
we are guilty of
come bug i've actually tweeted
the most common neural net mistakes a
long time ago now
uh and
i'm not really
gonna explain any of these except for we
are guilty of number three you forgot to
zero grad
before that backward what is that
basically what's happening and it's a
subtle bug and i'm not sure if you saw
it
is that
all of these
weights here have a dot data and a dot
grad
and that grad starts at zero
and then we do backward and we fill in
the gradients
and then we do an update on the data but
we don't flush the grad
it stays there
so when we do the second
forward pass and we do backward again
remember that all the backward
operations do a plus equals on the grad
and so these gradients just
add up and they never get reset to zero
so basically we didn't zero grad so
here's how we zero grad before
backward
we need to iterate over all the
parameters
and we need to make sure that p dot grad
is set to zero
we need to reset it to zero just like it
is in the constructor
so remember all the way here for all
these value nodes grad is reset to zero
and then all these backward passes do a
plus equals from that grad
but we need to make sure that
we reset these graphs to zero so that
when we do backward
all of them start at zero and the actual
backward pass accumulates um
the loss derivatives into the grads
so this is zero grad in pytorch
and uh
we will slightly get we'll get a
slightly different optimization let's
reset the neural net
the data is the same this is now i think
correct
and we get a much more
you know we get a much more
slower descent
we still end up with pretty good results
and we can continue this a bit more
to get down lower
and lower
and lower
yeah
so the only reason that the previous
thing worked it's extremely buggy um the
only reason that worked is that
this is a very very simple problem
and it's very easy for this neural net
to fit this data
and so the grads ended up accumulating
and it effectively gave us a massive
step size and it made us converge
extremely fast
but basically now we have to do more
steps to get to very low values of loss
and get wipe red to be really good we
can try to
step a bit greater
yeah we're gonna get closer and closer
to one minus one and one
so
working with neural nets is sometimes
tricky because
uh
you may have lots of bugs in the code
and uh your network might actually work
just like ours worked
but chances are is that if we had a more
complex problem then actually this bug
would have made us not optimize the loss
very well and we were only able to get
away with it because
the problem is very simple
so let's now bring everything together
and summarize what we learned
what are neural nets neural nets are
these mathematical expressions
fairly simple mathematical expressions
in the case of multi-layer perceptron
that take
input as the data and they take input
the weights and the parameters of the
neural net mathematical expression for
the forward pass followed by a loss
function and the loss function tries to
measure the accuracy of the predictions
and usually the loss will be low when
your predictions are matching your
targets or where the network is
basically behaving well so we we
manipulate the loss function so that
when the loss is low the network is
doing what you want it to do on your
problem
and then we backward the loss
use backpropagation to get the gradient
and then we know how to tune all the
parameters to decrease the loss locally
but then we have to iterate that process
many times in what's called the gradient
descent
so we simply follow the gradient
information and that minimizes the loss
and the loss is arranged so that when
the loss is minimized the network is
doing what you want it to do
and yeah so we just have a blob of
neural stuff and we can make it do
arbitrary things and that's what gives
neural nets their power um
it's you know this is a very tiny
network with 41 parameters
but you can build significantly more
complicated neural nets with billions
at this point almost trillions of
parameters and it's a massive blob of
neural tissue simulated neural tissue
roughly speaking
and you can make it do extremely complex
problems and these neurons then have all
kinds of very fascinating emergent
properties
in
when you try to make them do
significantly hard problems as in the
case of gpt for example
we have massive amounts of text from the
internet and we're trying to get a
neural net to predict to take like a few
words and try to predict the next word
in a sequence that's the learning
problem
and it turns out that when you train
this on all of internet the neural net
actually has like really remarkable
emergent properties but that neural net
would have hundreds of billions of
parameters
but it works on fundamentally the exact
same principles
the neural net of course will be a bit
more complex but otherwise the
value in the gradient is there
and would be identical and the gradient
descent would be there and would be
basically identical but people usually
use slightly different updates this is a
very simple stochastic gradient descent
update
um
and the loss function would not be mean
squared error they would be using
something called the cross-entropy loss
for predicting the next token so there's
a few more details but fundamentally the
neural network setup and neural network
training is identical and pervasive and
now you understand intuitively
how that works under the hood in the
beginning of this video i told you that
by the end of it you would understand
everything in micrograd and then we'd
slowly build it up let me briefly prove
that to you
so i'm going to step through all the
code that is in micrograd as of today
actually potentially some of the code
will change by the time you watch this
video because i intend to continue
developing micrograd
but let's look at what we have so far at
least init.pi is empty when you go to
engine.pi that has the value
everything here you should mostly
recognize so we have the data.grad
attributes we have the backward function
uh we have the previous set of children
and the operation that produced this
value
we have addition multiplication and
raising to a scalar power
we have the relu non-linearity which is
slightly different type of nonlinearity
than 10h that we used in this video
both of them are non-linearities and
notably 10h is not actually present in
micrograd as of right now but i intend
to add it later
with the backward which is identical and
then all of these other operations which
are built up on top of operations here
so values should be very recognizable
except for the non-linearity used in
this video
um there's no massive difference between
relu and 10h and sigmoid and these other
non-linearities they're all roughly
equivalent and can be used in mlps so i
use 10h because it's a bit smoother and
because it's a little bit more
complicated than relu and therefore it's
stressed a little bit more the
local gradients and working with those
derivatives which i thought would be
useful
and then that pi is the neural networks
library as i mentioned so you should
recognize identical implementation of
neuron layer and mlp
notably or not so much
we have a class module here there is a
parent class of all these modules i did
that because there's an nn.module class
in pytorch and so this exactly matches
that api and end.module and pytorch has
also a zero grad which i've refactored
out here
so that's the end of micrograd really
then there's a test
which you'll see
basically creates
two chunks of code one in micrograd and
one in pi torch and we'll make sure that
the forward and the backward pass agree
identically
for a slightly less complicated
expression a slightly more complicated
expression everything
agrees so we agree with pytorch on all
of these operations
and finally there's a demo.ipymb here
and it's a bit more complicated binary
classification demo than the one i
covered in this lecture so we only had a
tiny data set of four examples um here
we have a bit more complicated example
with lots of blue points and lots of red
points and we're trying to again build a
binary classifier to distinguish uh two
dimensional points as red or blue
it's a bit more complicated mlp here
with it's a bigger mlp
the loss is a bit more complicated
because
it supports batches
so because our dataset was so tiny we
always did a forward pass on the entire
data set of four examples but when your
data set is like a million examples what
we usually do in practice is we chair we
basically pick out some random subset we
call that a batch and then we only
process the batch forward backward and
update so we don't have to forward the
entire training set
so this supports batching because
there's a lot more examples here
we do a forward pass the loss is
slightly more different this is a max
margin loss that i implement here
the one that we used was the mean
squared error loss because it's the
simplest one
there's also the binary cross entropy
loss all of them can be used for binary
classification and don't make too much
of a difference in the simple examples
that we looked at so far
there's something called l2
regularization used here this has to do
with generalization of the neural net
and controls the overfitting in machine
learning setting but i did not cover
these concepts and concepts in this
video potentially later
and the training loop you should
recognize so forward backward with zero
grad
and update and so on you'll notice that
in the update here the learning rate is
scaled as a function of number of
iterations and it
shrinks
and this is something called learning
rate decay so in the beginning you have
a high learning rate and as the network
sort of stabilizes near the end you
bring down the learning rate to get some
of the fine details in the end
and in the end we see the decision
surface of the neural net and we see
that it learns to separate out the red
and the blue area based on the data
points
so that's the slightly more complicated
example and then we'll demo that hyper
ymb that you're free to go over
but yeah as of today that is micrograd i
also wanted to show you a little bit of
real stuff so that you get to see how
this is actually implemented in
production grade library like by torch
uh so in particular i wanted to show i
wanted to find and show you the backward
pass for 10h in pytorch so here in
micrograd we see that the backward
password 10h is one minus t square
where t is the output of the tanh of x
times of that grad which is the chain
rule so we're looking for something that
looks like this
now
i went to pytorch um which has an open
source github codebase and uh i looked
through a lot of its code
and honestly i i i spent about 15
minutes and i couldn't find 10h
and that's because these libraries
unfortunately they grow in size and
entropy and if you just search for 10h
you get apparently 2 800 results and 400
and 406 files so i don't know what these
files are doing honestly
and why there are so many mentions of
10h but unfortunately these libraries
are quite complex they're meant to be
used not really inspected um
eventually i did stumble on someone
who tries to change the 10 h backward
code for some reason
and someone here pointed to the cpu
kernel and the kuda kernel for 10 inch
backward
so this so basically depends on if
you're using pi torch on a cpu device or
on a gpu which these are different
devices and i haven't covered this but
this is the 10 h backwards kernel
for uh cpu
and the reason it's so large is that
number one this is like if you're using
a complex type which we haven't even
talked about if you're using a specific
data type of b-float 16 which we haven't
talked about
and then if you're not then this is the
kernel and deep here we see something
that resembles our backward pass so they
have a times one minus
b square uh so this b
b here must be the output of the 10h and
this is the health.grad so here we found
it
uh deep inside
pi torch from this location for some
reason inside binaryops kernel when 10h
is not actually a binary op
and then this is the gpu kernel
we're not complex
we're
here and here we go with one line of
code
so we did find it but basically
unfortunately these codepieces are very
large and
micrograd is very very simple but if you
actually want to use real stuff uh
finding the code for it you'll actually
find that difficult
i also wanted to show you a little
example here where pytorch is showing
you how can you can register a new type
of function that you want to add to
pytorch as a lego building block
so here if you want to for example add a
gender polynomial 3
here's how you could do it you will
register it as a class that
subclasses storage.org that function
and then you have to tell pytorch how to
forward your new function
and how to backward through it
so as long as you can do the forward
pass of this little function piece that
you want to add and as long as you know
the the local derivative the local
gradients which are implemented in the
backward pi torch will be able to back
propagate through your function and then
you can use this as a lego block in a
larger lego castle of all the different
lego blocks that pytorch already has
and so that's the only thing you have to
tell pytorch and everything would just
work and you can register new types of
functions
in this way following this example
and that is everything that i wanted to
cover in this lecture
so i hope you enjoyed building out
micrograd with me i hope you find it
interesting insightful
and
yeah i will post a lot of the links
that are related to this video in the
video description below i will also
probably post a link to a discussion
forum
or discussion group where you can ask
questions related to this video and then
i can answer or someone else can answer
your questions and i may also do a
follow-up video that answers some of the
most common questions
but for now that's it i hope you enjoyed
it if you did then please like and
subscribe so that youtube knows to
feature this video to more people
and that's it for now i'll see you later
now here's the problem
we know
dl by
wait what is the problem
and that's everything i wanted to cover
in this lecture
so i hope
you enjoyed us building up microcraft
micro crab
okay now let's do the exact same thing
for multiply because we can't do
something like a times two
oops
i know what happened there
Loading video analysis...