YouTube Video

By Unknown

Summary

## Key takeaways - **Without non-linearity, networks collapse to one layer**: If you stack linear transformations without activation functions (like ReLU), multiplying weight matrices collapses into a single matrix—depriving hidden layers of any additional representational power and making it impossible to solve non-linear problems. [18:03], [18:49] - **Use regularization, not network size, to control overfitting**: Rather than treating network size as a hyperparameter to tune, practitioners typically use a larger network than strictly needed and rely on regularization (controlled by lambda) to prevent overfitting and improve generalization. [29:40], [30:45] - **Backpropagation chains local and upstream gradients**: Each node in a computational graph computes its local gradient (derivative of output with respect to inputs) and multiplies it by the upstream gradient arriving from later layers, producing downstream gradients that propagate backward through the network. [49:16], [52:08] - **Mini-batch sizes of 32–256 balance speed and accuracy**: Using the entire dataset for gradient computation is prohibitively expensive; sampling 32, 64, 128, or 256 examples (mini-batches) provides a practical trade-off between gradient estimation quality and computational efficiency. [08:19], [08:33] - **Activation functions must be non-linear and differentiable**: All activation functions share two critical properties: they introduce non-linearity to break the chain of linear operations, and they must be differentiable to enable gradient-based learning during backpropagation. [24:26], [25:03] - **Hidden neurons learn templates for object parts**: With 100 hidden neurons versus 10 output classes, the middle layer can form 100 templates representing shared parts (e.g., eyes across bird, cat, dog), allowing the network to compose object representations from reusable components rather than learning whole templates for each class. [16:15], [16:56]

Topics Covered

Visualizing the Neural Network Learning Process
Why Neural Networks Need Non-Linear Activation Functions
Build a Neural Network in Under 20 Lines of Python
Sparse Jacobian Enables Efficient Gradient Computation
Jacobian Matrix Explodes to 256GB for One Operation

Full Transcript

As you can see on this slide, today, we are going to talk about neural networks and backpropagation.

Which is actually the process-- early years, I was studying this, I was often referring to it as the magical process that lets the neural networks learn from their own mistakes, pretty much like humans but in a more organized fashion,

and also using a little bit more math.

So let's dive into the topic.

I'm sure this is going to be exciting.

And this is laying the foundation for the rest of the quarter.

Every single algorithm that we will be discussing in the future without even mentioning is using a form of backpropagation.

And so that's why understanding this lecture and the topics is very important.

OK, in keeping us with the tradition, let's cover what we've talked about so far.

So I'm sure you now remember what we talked about last time.

We said, how we can form the objective functions, or loss functions, what we call here.

And then we talked about regularization.

But to do that, we formulated everything through the x, y, defining the pairs and a scoring function.

Which, in this case, we are using a linear scoring function, as you can see, and also defining, ultimately, this loss function.

So this graph that you see on the right is what we drew, showing all the process, the entire process of learning.

There has been some questions.

Questions in the last lecture and also even before that, that why we are only using the softmax function.

I wanted to reiterate that it's not the only loss function that we have and we use.

It's one of the most widely used in deep learning and building, especially for the task of classification.

But there are so many other options that we use for other tasks, for different tasks.

Even for the task of classification, if you've looked at the slides that we've shared on the website, I included this hinge loss, or used to be called SVM loss in the reading assignments in lecture two.

So in the slides, we had examples and everything around the topic of hinge loss.

It is also one of those widely used loss functions, especially in the early years of neural networks.

And just to give you a high level understanding of what it is, this is a loss function, that unlike softmax, does not turn the scores into probabilities.

So turning them into probabilities is not the only option.

So we can use other forms. This function encourages the score of-- let me highlight here-- the score of the correct items, which is defined by Syi to be higher than the scores

of all other items Sj.

You can see the condition here, creating a value of 0 if the condition is true.

And otherwise, what it does is-- so as I said, it encourages the score of the correct item to be higher than the scores of all other items by at least a margin.

That number one that you see there is the margin that it creates.

And then if the condition is violated, the loss increases proportionally from the margin.

And this is the visualization-- so creating this visualization of the function.

So this promotes correct scores by penalizing cases, where irrelevant items are scored too highly.

So again, refer to reading assignment in lecture two for examples and to get better understanding of that.

Next, we have talked about general optimization, how to find the best parameters, W, for the neural network.

And in doing so, we talked a little bit about this loss landscape being as a large valley as shown in this image.

And every point on that valley is a different set of weight parameters.

And we wanted to find the set of parameters, W, that minimizes that loss landscape.

We talked about the fact that the key is being able to take the gradient of the loss function, L, with respect to W and use the gradient for optimization in a step-by-step manner.

Which gave us the gradient descent algorithm.

So the weights are basically updated.

Although it's very hard for me to see from this distance, what I'm pointing to, but I can guess.

So anyways, in order to walk down the loss landscape towards the minimum value, a step size is defined and we often get to take one step with respect to a step size

in the negative direction of the gradient.

So this was the gradient descent algorithm.

And in order to optimize, we talked about two different approaches of numerical gradient and analytical gradient, both of which having pros and cons.

And we discussed in practice to derive analytical gradients-- in practice, we derive analytical gradients.

And often, if it's hard to do the implementation and the math and everything, we check our implementations with numerical gradients.

And one of the other challenges we talked about was the use of incorporating the loss function and its gradients on the entire data set.

So if you have a large data set, it's very expensive to run the loss function and the derivative on the entire data set.

And that's why we talked about the idea of mini batches, using a number of examples sampled from the data set, often maybe 32, 64, or 128, or 256.

And that subsampled data is used for identifying the gradients and then taking the steps towards the minimum.

And beyond SGD and stochastic gradient descent, we talked about some optimizations of SGD with Momentum, RMSProp, and Adam optimizer.

And there were a lot of details that I would refer you to the lecture, the third lecture, if you have any specific questions about this.

And then one of the other things that we talked about was the importance of the learning rate and scheduling the learning rate.

And in some of the optimizers, we often try to start with a larger value of the learning rate and then start using different types of decaying the learning rate, or reducing its value by a factor.

And this is normally needed in many optimizers.

But in some of the more recent ones, Adam and its variants, we often do not need to manually, or explicitly decrease that because that is encoded into the optimizer itself.

So with that, I want us to get to the topic of neural networks and see how we can actually build neural networks and solve more exciting and harder problems.

So we've so far talked about this function, linear function, of W multiplied by x.

And that is the most basic neural network that could be defined as just one layer.

We will be talking about the layers.

And what I want you to pay attention to here are these dimensions D and C, which are the dimensionality of the input data, input x, or the number of features.

And C is the number of classes, basically, the number of outputs, nodes, or neurons, whatever number of outputs we need.

And in order to create a neural network at a second layer, we can define a new set of weights referred to as W2 here.

And we apply those to the previous layer of W1 multiplied by x.

Again, pay attention to the dimensionalities here that we have the C number of outputs and D as the number of input features.

But then you also define H, and that defines the number of neurons, the number of hidden layer nodes, or neurons.

That's one point.

The second point is this max function that we'll be coming back to.

And we'll explain what it is and what it means.

What the max operation is doing here is to create a non-linearity between the linear transformations done by W1 and W2.

And this is actually a very, very important part of the process.

I will talk a little bit about the non-linearity but also look at this last part before I forget.

In practice, that's right that we are only including W and x, as we talked about this in the first and second lecture.

We also incorporate a bias just to have a complete framework.

So in practice, we also have bias, but we don't write it here for the sake of simplicity.

Anyways, the max operation is creating the nonlinearity.

And it's actually very important because, as we talked about the linear classifiers in the last few lectures, we said, we mentioned that there are so many different problems that we can't separate the samples with just one single line.

This was one of the examples that in order to be able to solve this problem with neural networks, with linear functions, we need some sort of nonlinear transformation from the original space to a new space.

And now, in the new space, you see that they are separable, using a line.

So in this case, it's a nonlinear transformation between the input and then the second space, which is mapping the x and y to their polar coordinates, r and theta.

But again, this is just one example.

There are so many other examples, too.

So with this example, let's go back to-- oops, let's go back to our definition of the two layer neural network.

As you've probably seen in the literature and outside this class, these types of networks which only depend on weights and inputs and layers and so on, there are no other operations than multiplication, are often referred to as fully connected networks, or multi-layer perceptrons, MLPs.

So that's one thing.

And we can actually stack more and more layers to create better, more larger networks.

And in this case, again, pay attention to the dimensionalities and the hidden layers that we have in the middle and the dimensionalities that do match one after the other.

So back to this visual representation of what the neural network is doing.

We talked about this when we had the linear representations, that often what happens is the network, through the weights, is learning some sort of templates.

If you remember last week, we were talking about these templates that are being learned.

So again, I'm saying, templates, they're not.

I mean, they are some representatives of the images but from the data, depending on what data it was trained on.

So these templates in what we discussed last week were generated by these 10 outputs by applying the W's on top of the input neurons.

So with that, now that we have multiple layers, more layers, now we can actually create some more templates.

Now, we have a layer in the middle that can actually create 100 templates, as opposed to just 10 for a linear classifier.

Although we still have those 10 as well.

And this, again, in a very high level-- from a very high level understanding point of view, I'm telling you what this means.

When we have these 100 neurons in the middle, we are giving the network the power to create templates for not entire objects but maybe parts of the object.

For example, the classes that you see here, we had bird, cat, deer, dog, frog, horse, they all have eyes.

So one of those 10 templates, 100 templates could probably be a part of the object that could be shared between all of the classes.

So from a high level point of view and understanding these can form templates.

And when we come back to the topics of visualization and what we learned from the neural networks, this topic, we'll uncover more details about what I'm talking about right now.

So back to the function max.

We talked about max function, the nonlinearity that is created here.

And in neural network terminology, we call that an activation function.

And it's actually playing a very, very important role, a pivotal role, in building the model, building a neural network.

Let's answer this question that we have on the slide.

What happens if we try to build a neural network without one of these activation functions, let's say, the max function?

This will be our function if I remove the max.

So it would be W2 multiply by W1 by x.

What would happen here?

Yes exactly.

So as you can guess and correctly, you mentioned, the multiplication of W2 by W1 could actually be replaced easily with another matrix, W3.

And then your function becomes just a linear function.

So everything could be lumped together.

So we need some sort of non-linearity in the middle to be able to give us the power to solve non-linear problems. The function that we just talked about is a ReLU.

It's the rectified linear unit.

It's a very popular function, activation function, used in neural networks.

While there are so many other variants that have been tested in many other architectures and even in the more modern architectures, one of the problems that ReLU has is, it sometimes creates dead neurons because it's making everything equal to 0, if it's not positive.

So in order to avoid the dead neurons, leaky ReLU, with this type of modeling, or ELU, that the exponential linear unit are other options.

ELU is a little bit better because it has a better zero-centered function.

And then there are some newer variations, GeLU, Gaussian, or linear unit.

Or I don't know, I've heard both variations, ELU and GeLU-- so could be used.

They are often used more often in neural architecture in transformers.

And we also have SiLU, or switch.

It's the sigmoid linear unit that, one, is also used in some of the modern CNNR architectures.

Google was using this for some of the variations of their models and also in EfficientNet.

Other than these, there are functions like sigmoid and Tanh, or Tanh that are often also used as activation functions.

Although they do have a few problems because they do squash values in a narrow range.

And that sometimes results in vanishing gradients.

So we often do not use sigmoid or Tanh in the middle of the neural networks.

They are often used in the later layers, where we want to, for example, binarize the outputs and things like that.

So as I said, ReLU is often a good default choice.

It's very much used in many architectures.

And there are so many variations of the same function that we talked about.

I want to summarize what we've talked about and then answer some questions.

So we did talk about different adding layers and so on.

But I want to highlight that activation functions are often functions that are operating in the layers.

And we also have W's, which define the weights mapping between the previous layer and the next layer.

Again, these are fully connected neural networks with very simple implementations.

What we only need is to be able to define an activation function.

And in this example, if you look at the example, we have the sigmoid function defined as the activation function and very easily using that activation.

The first and the second layers of the hidden values, hidden neurons, are calculated by applying W1 by x and then also the bias and then applying the function, the activation function.

And then same for H2, and the output will be very simply the dot product between the W3 and the last layer of hidden values, creating the output layer.

I'll stop here to answer some questions if there are any.

And then, we'll love to continue it.

That is a great question.

And the question is, how would we choose for a new problem which of these activation functions to use?

The short answer to your question is yes, it's empirical in most cases.

But we often start with value, or we go with standard activation functions being used for those specific architectures.

As I mentioned, there are activation functions that are often commonly used in CNNs, or in transformers and different architectures.

So we often go with the ones that are tested before.

But yes, it's mostly empirical.

If you're designing a new network for a new problem, then that's one of your choices that you have to make, very much similar to other hyperparameters.

So the question here is, what is the attribute that is basically common between all of these activation functions and what it really does?

I will give you some examples.

And I'll go into some of the details of what these activation functions are doing.

Basically, the main and the most important common characteristic here is to create nonlinearity.

We're not using a linear function as the activation.

So creating some sort of nonlinearity is something that makes it very important.

And why do we have so many variations?

I told you a little bit about the problems with vanishing gradients.

I told you a little bit about differentiability of the functions.

They should be differentiable because we are using them in neural network.

And sometimes, having a proper zero-centered value and a smooth function makes it much faster to get converging networks.

So there are so many different factors.

These are the main ones that I told you and talked about, which play an important role for defining, or designing these functions.

I'll talk a little bit more about it when I go into details of the functions.

In all of the layers, we often use same activation functions.

But as I said, sometimes, in the later layers, or the output layer, we use a activation sigmoid function and/or tangent function, but commonly yes.

And the question was, if we use the same across the networks, the entire network, same function for all of the neurons?

OK, continuing to what we were talking about, which is the implementation of these models, these neural network.

So there is a very simple way.

I mean, building a neural network, a two layer neural network, in Python is just less than 20 lines of code.

Very simple, define our network as I talked about the dimensionalities.

N is the number of samples.

D_in is the dimensionality of the input.

And D_out is the dimensionality of the output.

And h, the number of neurons in the hidden layer.

And we talked about-- this is just creating x and y and randomly initializing W's.

Then we have the forward pass, which means applying W's to the inputs, layer by layer, and ultimately creating the output, the predicted wise, and finally calculating the loss function

and outputting that loss value.

After the forward pass, we need an optimization process, a way to calculate the analytical gradients and use those gradients that are created to run gradient descent to optimize W1 and W2, basically taking one step towards the optimal value

of the network.

But this part, calculating the analytical gradient, is the most important part in here that we haven't very much gone into.

So this is the-- almost the rest of this lecture is about making this work and scale in different settings.

So after training and building such a neural network, depending on how many nodes we use in the hidden layer, you see that we can identify, we can get different patterns of separation between the two classes.

And more neurons often means more capacity to learn more complex functions and better separation of the nodes, the points.

If you take a look at this, this is very much similar to this.

This pattern I'm showing here is similar to the one that I showed in the second lecture, where we are talking about k nearest neighbor.

And when we had only k equal to 1, the one nearest neighbor framework, it was very much similar to using more neurons.

So same type of arguments happen here that if we give a lot of capacity to the network, then we will have some overfitting problems. We won't be able to generalize to unseen data.

But there are many different solutions for this as well.

And one thing that-- as a rule of thumb, what I want to highlight here for you is to not use the size of the neural network as a regularizer.

We don't often use that as a hyperparameter to finetune this network size.

Although, we experiment with different values of the network size and related hyperparameters.

But what we often do is we go with a little bit of a bigger network that we need.

And then we use the regularization and this regularizer, and specifically, this regularization hyperparameter to check the different setups.

So what we often tune is the regularization and regularization hyperparameter and not necessarily the network size itself.

OK, this is the concept of neural networks in a nutshell.

But we have heard about neural networks and how they could be related to the biological-- there are some biological inspirations.

So I'll talk a little bit about it, but there is a question.

Basically, your question is, why is the model more underfitting when we increase the value of lambda here?

Yes, so just to quickly answer that question, the value of lambda is controlling how much contribution the regularizer should have in the overall loss.

And the larger contribution that you have on the regularizer-- and remember that regularizer was defined on W's.

So it's constraining the W's.

It's giving you less freedom to the values on W's.

So less freedom equals a little bit more generic boundaries, not necessarily giving you those detailed values, or detailed parts of the boundaries.

So if you constrain the model too much, even with regularizer, you're also going to get values like that, decision boundaries like that.

Yes, the right regularizer always overfits-- it prevents overfitting.

Again, you are creating a compromise, a balance between the loss, like predicting the right output.

So the first part of the loss is predicting the right output.

The second part is only playing with the values of the weights, doesn't care about the outputs anymore.

If you overweight this, you're not going to get very good classifiers.

So creating a balance, regularizer is always good.

But nothing is good if you use too much of it.

Can you go over again why we would want to choose the regularization rather than the size [INAUDIBLE]?

So there are many different reasons.

One of them is size of the network.

You're building the networks.

You're going to build networks that, sometimes, that you have to run them for few days to get some results.

So networks, what we often do is, we start increasing the number of parameters in networks, until we see some levels of overfitting.

So that's the time that we know that the network is actually understanding the patterns in the data and is now able to memorize the data.

And that's the time that we try to minimize the overfitting by regularizing the network.

So regularization plays an important factor there.

So if we go too high on the number of parameters, number of complexity of the network, then that's going to be causing a problem.

We never often do that.

We often, for a new problem, start with smaller networks and increase that after with-- correct that with the regularizer.

For a given problem, how do we know how many neurons we need to solve the problem?

That's based on empirical research work and looking at other similar type of-- there is no one prescription for all.

You have to look at other counterparts, other types of networks that were trained on similar data, start from that range.

And then often, you do a number of experiments yourself to balance and increase or decrease the complexity of the network.

So it's often and always pretty much bound to exploration.

So your question is, are there any theoretical and foundational work done to see which ones to use, which activation functions to use, and how many layers to use?

There are so many research and papers out analyzing these and also some methods for optimizing all of these meta or hyperparameters of the networks.

We are not going to get into them in detail because, again, a big part of it is based on-- very much dependent on the data set, the problem you're solving.

And so the best answer to your question is, yes, there are some works out there.

But again, each of those make assumptions that may not be necessarily true for your application, or problem.

So what happens is, there are some biological inspirations.

Again, these inspirations are very much very loose.

If there is a neuroscientist sitting here, or is watching online, do not take all of the examples that I'm giving you, or talking about as the ground truth.

But generally, what happens in neurons-- and this is a visualization of a neuron-- it does have a cell body that often aggregates the impulses carried through dendrites to the cell body itself, cell body.

And then using axons, those impulses are carried away to other neurons.

This is very much similar to what we are doing in our neural networks.

We often have a function that captures the signals, all of the previous impulses, activations from the previous layers.

And in the cell body, that function is operated on the inputs and outputs the activations and passes them to the next layer, next neuron.

And that's, basically, why we need some sort of activation function here to create the impulses to increase, or decrease the values.

So with that, again, there are many differences between biological neurons and how they could actually be very more complex than what neural networks we build look like.

But generally, there are common concepts.

Often, the neural networks that we build are organized into regular patterns.

And those patterns are because we want to have better computational efficiency when we implement the neural networks.

Although there has been research creating these complex neural networks and trying to optimize, but again, in terms of results, they are almost comparable with the regular functions, regular neural networks that we often build and we'll be talking about in this class.

I can't warn you enough on being careful with your brain analogies and how this could be interpreted.

So there are so many differences.

And I'll just stop here and would be happy to discuss if anybody was interested in the neuroscience aspect of things as well.

So plugging everything in, we did have a scoring function.

This scoring function turns the inputs through some W's weight vectors, or weight matrices into scores.

And what we often use as the loss function for the network is using those scores either through hinge loss, or softmax, or other variations.

And in addition to that, we defined regularizers, which ultimately give us the total loss of the data loss plus regularizer.

And we talked about this fact that in order to be able to optimize W1 and W2, what we need is to be able to take the derivative of, the partial derivative of L with respect to W1 and W2, partial L by W1 and W2.

There are so many different details that we have to be aware of.

First, building these functions and then taking the derivatives and writing them down is often tedious.

There are lots of matrix calculations and need a lot of work on the paper before you can actually implement a neural network.

The other challenge, the other problem is, what if you want to change the loss slightly different from what we have done in the paper, all of the calculations over?

So in that case, again, we have to redo the entire thing.

And finally, this becomes intractable and, sometimes, infeasible if the loss function is complex.

So with complex functions, that's going to be even harder.

But there is a better idea, something that is often used in our implementations.

And I'm going to go into a few examples today just to make sure everybody is on the same page and understands these topics.

And that is computational graphs and the idea of backpropagation.

Computational graphs are putting together all of the operations in the neural network and creating that step-by-step thing and start from the inputs and all of the parameters that are basically needed and get the loss as the final output, final layer.

So in this case, we had a loss function, which could be a softmax function, or a hinge loss function.

Whatever it is, it's the loss function, which is added to the regularizer, the function RW.

And R has W as its input.

So these two added together calculate, or create the loss.

And before doing the, or having the loss calculated, we often also need to aggregate x and W and create the score.

This is a multiplication function.

This is actually very useful because most of the neural networks that we build are also they have graphical card representations.

And all of these complex functions could be shown with the same framework.

And then we can use this and build their computation graph, starting from input image or input data.

There are a bunch of weights throughout the network.

And finally, there is the loss function.

And again, this is useful because there are some complex neural networks, like this neural Turing machine.

And that is actually used for temporal and sequential data.

So there is a lot of unrolling of this machine.

And if we have to do all of the work manually by hand, this is going to be intractable and not feasible.

And that's why when we build this computational graph, the solution to that is backpropagation.

And I want to start with a very simple example.

So we start with a function f of x, y, and z, which is x plus y multiplied by z.

And if I draw the computational graph for this function, you see we have an operation, which is the addition operation between x and y.

And then we have a multiplication between that addition of x and y multiplied by z.

So given an input setup of x equal to minus 2, y equal to 5, and z equal to minus 4, now actually, we can make all of the calculations for and do this, the forward pass, stepping forward in the neural network.

The first step is adding x and y, which gives us 3.

And in order to be able to understand the steps and step-by-step, I'm giving a name to it.

So q equals x plus y.

And if I want to calculate the partial derivatives of q with respect to both x and y, it's very simple because we have the formulation here between q and x and y.

The formulation is there.

The derivatives, the partial q by x equals 1 and partial q by y equals 1 as well.

So this is a simple setup.

We know it exists.

So just keep it in the back of our minds.

Then the second operation is f equals q multiplied by z.

Again, since we have this function, it's very easy to write the partial derivatives.

Partial f by q equals z.

And f by z equals q.

So it's swap between z and q.

I'm hoping that everybody knows all of these from linear algebra.

So if you don't, you should definitely check it out and remind yourself because these are actually very, very important algebra in general for the rest of the quarter.

What we want and what we need in this setup, and to complete this example of backpropagation, we need the partial derivative of f with respect to x, y, and z.

How we start and how our backpropagation implements this is to start at the front of the net-- at the end of the network.

And we start going back, backpropagating all of the gradients.

And this is basically a recursive process that will be running.

So derivative of f with respect to f is what?

It's the thing with respect to itself.

So it's always the last part, the derivative of loss function with respect to itself is always 1.

If I want to backprop, the most immediate one is z.

You can see here that we have z.

And for this one, if I calculate the derivative of f with respect to z, we already have it.

F with respect to z is equal to q.

So whatever the value of q is goes to this as the gradient as well.

Next, we have q.

Q is the next one-- our next one that is directly connected to f.

So this is also easy to compute because we have derivative of f with respect to q.

We have also already calculated that it's equal to z.

Whatever z is, that's the value of derivative here minus 4.

Next, we have y, which is directly before q.

And we know that y and f, although we need derivative of f with respect to y, but y and f are not directly connected.

And that's where we use the chain rule, where we split the calculation of derivatives with respect to the variable in the middle.

So partial f by y equals to partial f by q, q by y.

So this is how the chain rule could be written in this case.

And now, I want to introduce you to two important new terms-- local gradient and upstream gradient.

Upstream gradient is often the gradient that comes from the end of the network to this current node that we are in.

And then the local gradient is the gradient of the nodes, what the input of the node is, with a gradient of its output with respect to its input.

So it's the local gradient.

So defining these is actually not too hard because f by q, you already have the value.

Q by y, we also already have the value.

So it's 1 multiplied by z, and the value will become minus 4.

Same story, and it's for the other variable, x.

Here, the local gradient upstream could, again, be written down with this chain rule.

And it also results in minus 4 and gives us-- because in both cases, the gradient with respect to x or y was already 1.

So both of them get the same value.

So with this computational setup and the computational graph, it becomes very easy to modularize what we want to do.

For every single node in the neural network, having x and y as input, or whatever else, and z as the output, what we need are, first, the local gradients.

Which we can always-- we have the function f.

It's a function of x and y.

So the gradient of output with respect to each of the inputs, it's easy to calculate for every single node.

And what we need to be able to backpropagate is the upstream gradient.

And the backpropagation process is giving us the power to get this upstream gradient as step-by-step.

So when we are at this node, we also have the upstream gradient already calculated from the future nodes.

And that's what we need.

What we can do after this is just to multiply the upstream gradient with the local gradient and create what now we call downstream gradients.

So the downstream gradients are going to be upstream gradients for the previous layers.

So that's how we calculate that for x, same story when it comes to y.

So this whole process gives us the power to create and calculate all of these completely locally and a step-by-step go backwards and give it to the previous nodes so they can continue their process.

So again, this is one of the most fundamental operations in all of neural networks and many optimization processes involving multiple layers of information.

If I understand the question correctly, you're saying, how can we understand this, intuitively, what the gradients are doing?

So let's take one step back and see why we are here to begin with.

What we needed was to identify to calculate the gradients of the loss function with respect to W1 and W2 and W's in general, to be able to take a step in the negative direction of, in the opposite direction of the gradients to be able to find the optimal value, optimal loss.

So in order to do that, we need gradient of L loss with respect to everything.

So what we are doing is, we are just moving gradient of L with respect to all variables in the network, back to every single value of the network, without sitting down and writing the function for the entire network.

If the network has 100 layers, we're not going to be sitting down and writing the function for all of the 100 layers separately.

This is how we backpropagate step-by-step to get the values that we need for that optimization process of every single weight that is going to be incorporated in the network.

OK, another example, so this is a little bit more complex function, a function of weights and x.

And we have 1 over 1 plus e to the power of linear combination of x and w.

So there are a bunch of multiplications, additions, negation, and the exp function, and ultimately, the 1 over whatever we calculated function.

So with all of those, let's look at this example that we have, specific values for W0, x0, W1, x1, and W2.

With these given values, we can do the forward pass, calculate every single value that we have in this process.

And just to remind you, we do have some of the details, some of the-- we know for an exp function, e to the power of x, what its derivative is with respect to x constant multiplication.

Always, the derivative is the constant value itself, 1 over x, as a derivative of minus 1 over x2.

These are, again, what we know from algebra.

And if it's a constant addition, it's always, the derivative is equal to 1.

So as I said, always in the very beginning at the end of the network, the derivative of L with respect to L is always equal to 1.

So that's where we start using this rule, the derivative of function 1/x.

Now, we can calculate-- upstream, I said, it's 1 always at the end.

The local gradient could be minus 1 over x2.

What is the value of x?

Whatever the input is.

So this calculation results in minus 0.53.

So minus 0.53 is the downstream gradient, which defines the upstream gradient for the next one.

And in the next, again, the function here is just the constant addition, where we know that the local gradient equals to 1.

So 1 multiplied by upstream gradient, same value, goes back.

And next step is the exp function.

So for that, again, the upstream, we already have the value.

For the local gradient, it's e to the power of x.

What is x?

The input of this step minus 1.

So calculating this will give us minus 0.2.

And this goes back to the next step.

Here, we, again, have a multiplication with a constant number, which defines the local gradient equal to that number, or that constant value, and defining the new gradient, downstream gradient.

And going back, now here, we have an addition function, where we are getting some data, some information-- sorry, two inputs of the different values here.

And again, if you want to calculate the upstream gradient, it's 0.2.

Already, we have it.

The downstream, the local gradients will be equal to 1 because it's just an addition between two values.

And an addition, the derivative of x plus y with respect to both x and y is always 1.

So both inputs will be the same.

Then We have multiplication operations with multiplication upstream gradient.

Again, we have the values.

And the local gradients with respect to a multiplication is, if we always have, say, for example, a multiplied by x, the derivative of this with respect to x is always the other variable.

So here, for the first one, it's minus 1, which is the value of x.

And for the second one, it's 2, which is the value of w.

So the other variable, whatever the value it has.

So with that, we can calculate everything and then also calculate the ones with respect to W1 and x1.

Again, we made all of these calculations, so we can identify how much W should be changed in order to step towards the optimal point in the network.

So this was another example.

There are so many different ways to draw a computational graph.

This was not the only one that I explained.

So we can actually lump all of the functions together and define a sigmoid because this is basically a sigmoid of a linear function.

So the linear function could be here.

And then all of these operations could be defined as sigmoid.

And actually, sigmoid is interesting and very useful to use because the local gradient using sigmoid is dependent on sigmoid itself.

So the local gradient of sigmoid with respect to the variable x, if we do the calculations and simplify, it's 1 minus sigmoid multiplied by the sigmoid of the same x.

So it's actually a very useful framework, useful function, and easy.

And in order to calculate the downstream gradient, again, what the upstream gradient was, was value 1.

And if I calculate the local gradient, which is this function, replacing x with what the input was, which is 1, it's how this will be calculated multiplied by 1 will be 0.2.

Which is actually the exact same value that we had from before doing it separately.

I want to summarize and say that there are few patterns in the data, often very much in for the nodes that we can actually memorize.

There is add gate for the add gate.

It's always a gradient distributor.

Because of the properties of addition that I explained, the gradients will remain the same as whatever its input is.

For the multiplication gate, it's a swap function.

Again, I told you, gradient of xy with respect to x is y, with respect to y is x.

So it's a swap.

And then there is copy gate.

And the copy gate, the operation that happens is just an addition of what is coming to the network-- to the node, or the gate.

And then ultimately, there is a max gate, which is actually something that we use quite often, very much similar to the ReLU function.

That max gate has the gradient of-- because it's taking a max between its inputs.

So whichever the max value was, you just route the gradients towards that direction.

So with that, it's very simple to now implement a neural network, forward pass, compute all of the steps.

And then in the backward pass, we start computing the gradients.

In a step-by-step, I explained that the gradient of the loss function with respect to itself is always 1.

And then we start from the end of the network and go up.

You can see here, that we are going up.

So this is the sigmoid function calculating the gradients, then going up.

That was the add gate.

We had another add gate.

And then we had two multiply gates, which basically, very simply, gives us the implementations.

And with this type of formulation, what we can do is just do some-- modularize every function in neural network and create the forward and backward APIs for every single function that we need in the neural network.

So in this case, this is a multiplication gate that-- because for multiplication, we need to access the inputs for use in the backward pass.

We often save them, memorize them, but then calculate the forward pass values and then the backward pass, calculate the gradients.

So this means we can write our functions and put the forward and backward passes all in.

And this is how PyTorch operators right now look like.

If you look at the sigmoid layer, for example, it's just the forward pass.

Although in this very function it's not implemented.

It's somewhere else in the C++ code, in the C code, that it's actually implemented in PyTorch.

But then the backward pass of sigmoid is also calculating the same function that we just talked about.

So far, what we've said-- and I actually covered most of the examples that I wanted to cover in using the scalar values.

All of the examples were just scalar values.

But we know that all of these operations could actually be implemented in vector or matrix forms, just expanding on that piece here.

We talked about this, that with the scalar to scalar setting, so far, what we've talked about for any input x and y being scalars.

So the derivative will also be a scalar.

Which means if we change x by a small amount, how much the value of y will change?

If it's now vectorized and there are vectors, if x is a vector of n elements and y is a scalar, a vector to a scalar derivative, then in this case, the derivative will also be a vector.

And every single element in that vector means if we change x by a small value, that value, then how much the amount of y changes, then the entire amount of y because it's just one single value.

And then there are also vector to vector frameworks, where x and y, both of them are vectors tours of arbitrary size.

And M, in those cases, the derivatives will form a matrix, or what we call Jacobians.

And for each of the elements in x, if it changes by a small amount, then this derivative tells us how much each element of y will be changed.

Again, look at the scripts here, they are not completely-- they're not the same.

They could be different.

For every single element in this Jacobian, there is a clear meaning.

And then how we can do, or see it and visualize it in here is that if you want to backprop with vectors-- say, x, y, and z are vectors of size dx, dy, dz.

Again, the loss derivative, L, is-- the loss itself is always a scalar because that's always one value we want to minimize.

But then calculating the upstream gradient will result in also a vector dz, same size as its variable z.

And same story happens when it comes to downstream gradients.

In downstream gradients-- actually, before going to downstream, let me tell you a little bit about the local gradients, where we have gradient of z with respect to x and y.

And in this case that's the part that I said there will be Jacobians because now, the gradients will turn into matrices.

So we have two Jacobian matrices here defined by the size of their input multiplied by the size of the output.

And then this results in downstream gradients that are multiplication of upstream and the local gradient.

And there, we get same size as the inputs x itself.

So we will have a vector again here because the input was a vector, same size, in terms of the gradients.

I just mentioned that gradients of variables with respect to loss always have the same dimensionality as the original variable itself, as also shown in this slide.

So backprop with vectors was this, just one example here.

Let's say we have a function, which is the max of 0 and x.

That's the ReLU function.

So this is an element wise function that takes the inputs, takes the max between 0.

And if it's nonzero-- if it's non-negative, it passes through, otherwise replaces it with a 0.

Assume you get some upstream gradients and now, we need to build a Jacobian matrix here.

And this Jacobian matrix, in this case, because this is an element wise operation, it doesn't have any dependence on any of the other inputs, only the dependencies on the value itself.

This is a very sparse matrix, only has value on the main diagonal.

And those values are actually either 0 or 1, depending on if the max was actually taken, or a 0 was if the value was passed through, or just the 0 was replaced with it in its place.

Multiplying it by the upstream gradient gives us the downstream gradient.

And this is how the calculations are done.

As I said, Jacobian here is sparse because in this case, the operation is element wise.

And that could actually be instead of-- in the backward pass, instead of calculating that huge, sparse Jacobian matrix, what we do is just use these rule-based calculation of the gradient for this max function.

So we don't really store that matrix and do not calculate that because we know how the function operates.

And then this could also be extended to matrices and even tensors.

If the inputs are not vectors, they are high dimensionality data.

So in those cases, again, the gradients with respect to the variables would be of the same size as that specific variable.

And calculating the upstream and downstream matrices and derivatives is going to be done same as how we discussed and showed earlier for vectors.

And then when it comes to the local gradients, although-- however, it's going to be a huge matrix, a huge Jacobian because we have a matrix as the output and the matrix as the input.

And then the local gradients will be the same size as the multiplication of its input size and the output size.

So it's going to be a huge matrix by itself.

Let me give you an example.

If this is input x and w as input for a node for a gate with matrix multiplication and generates this y as the output, calculating derivative of L with respect to y gives us these Jacobian matrices.

And say, we have a mini batch size of 64 and dimensionality of those matrices is 4,096, then this means that, that Jacobian matrix, that huge Jacobian matrix will be over 256

gigabytes for just one single matrix multiplication.

So in order to simplify this, what we do is, we try to look at the values and how they impact each other.

For example, what parts of y will be affected if one element of x gets impacted?

So there, xn and d, this specific one, often affects just one row in the output.

And this basically helps us identify for calculating each of these nodes.

We don't need to create the huge Jacobian.

We can actually write the backward pass functions specifically for matrix multiplication in a more efficient way.

And I'm almost done.

So you answer this question.

How much does xn, d affect the value of y and m.

So this is yn, m that is getting impacts from xn, d.

How much does it get impact?

It means that what should I place, or put as its gradient with respect to the specific value, xn, d?

Just to remind you, this is a multiplication operation.

In multiply gates, it should be a swap.

So whatever the answer to this question, is something a value in W?

Remember that we had this multiplication gate, which was a swap multiplier.

So there is a swap happening here.

So the value of x affecting y, one of the elements in y, is going to be dependent on the W that is in row D defined by the x matrix and column M defined by the y matrix.

So it's swapping the values.

It's the same swap.

But here now, we have to look at the giant matrices and find which specific element it should be.

And then based on that, we can actually replace the entire thing with matrix multiplication and matrix operations.

The gradient of L with respect to x will be defined as this simple matrix operation.

And then the gradient of L with respect to W will be defined as this very simple multiplication.

Again, swap here, for x, we include the entire w.

For w, we include the entire x and do the multiplications.

These formulas makes it easy to implement larger and harder operations and get them implemented in the backward passes.

All right, we're done.

Just to summarize, we talked today about fully connected neural networks.

We went through all the steps needed for backpropagation, the forward passes, backward passes.

And next session, we will be getting into the topic of convolutional neural networks.

Thank you.

Loading...

Loading video analysis...