Recurrent Neural Networks (RNNs), Clearly Explained!!!

By StatQuest with Josh Starmer

Summary

## Key takeaways - **RNNs handle variable-length sequential data**: Recurrent Neural Networks are designed to handle sequential data where the amount of input varies, unlike previous neural networks that were fixed to a specific number of inputs. [03:56] - **RNNs use feedback loops for memory**: The key difference in RNNs is the feedback loop, which allows the network to use sequential input values, like stock prices over time, to make predictions by incorporating past information. [04:24], [07:12] - **Unrolling RNNs visualizes sequential processing**: Unrolling an RNN's feedback loop creates a new network with a copy for each input value, making it easier to visualize how both past and present data influence the final prediction. [07:49], [08:14] - **Shared weights are crucial in unrolled RNNs**: Even when unrolled, RNNs share the same weights and biases across all inputs, meaning the number of parameters to train does not increase with more sequential data. [10:39], [10:45] - **Vanishing/Exploding Gradients hinder RNN training**: As RNNs are unrolled further, gradients can either explode (when weights are > 1) or vanish (when weights are < 1), making it difficult to train the network effectively. [11:24], [13:48] - **Gradient explosion amplifies early inputs**: If a weight is set to 2, unrolling the network multiple times exponentially amplifies the initial input value, leading to extremely large gradients that disrupt training. [12:51], [13:15]

Topics Covered

Recurrent Neural Networks handle variable-length sequential data.
Recurrent Neural Networks use feedback loops for sequential memory.
Unrolling RNNs visualizes sequential processing and shared weights.
Shared weights across unrolled RNNs prevent parameter explosion.
Vanishing/Exploding Gradients hinder RNN training with long sequences.

Full Transcript

Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to talk about

recurrent neural networks, and they're going to be clearly explained!

Lightning and Grid are totally cool. Check them out when you've got some time.

NOTE: This StatQuest assumes that you are already familiar with the main ideas behind

neural networks, backpropagation and the ReLU activation function. If not, check

out the 'Quest. ALSO NOTE: Although basic, or vanilla recurrent neural networks are

awesome, they are usually thought of as a stepping stone to understanding fancier

things like Long Short-Term Memory Networks and Transformers, which we will talk

about in future StatQuests. In other words, every Quest worth taking take steps,

and this is the first step. So with that said, let's say Hi to StatSquatch. Hi! And StatSquatch says,

Hello! The other day I bought stock in a company called Get Rich Quick, but the next

day their stock price went down and I lost money. Bummer. So, I was thinking, maybe

we could create a neural network to predict stock prices. Wouldn't that be cool?

That sure would be cool 'Squatch, unfortunately the actual stock market is crazy complicated

and we'd probably both get in a lot of trouble if we offered advice on how to make

money with it, but if we go to that mystical place called StatLand things are much

simpler and there are far fewer lawyers. So let's build a neural network that predicts

stock prices in StatLand. However, first let's just talk about stock market data

in general. When we look at stock prices, they tend to change over time. For example,

the price of this stock went up for four days before going down. Also, the longer

a company has been traded on the stock market, the more data we'll have for it. For

example, we have more time points for the company represented by the blue line then

we have for the company represented by the red line. What that means is, if we want

to use a neural network to predict stock prices, then we need a neural network that

works with different amounts of sequential data. In other words, if we want to predict

the stock price for the Blue Line company on day 10, then we might want to use the

data from all nine of the preceding days. In contrast, if we wanted to predict the

stock price for the Red Line company on day 10, then we would only have data for

the preceding five days. So we need the neural network to be flexible in terms of

how much sequential data we use to make a prediction.

This is a big difference compared to the other neural networks we've looked at in

this series. For example, in Neural Networks Clearly Explained, we examined a neural

network that made predictions using one input value, no more and no less. And if

you saw the StatQuest on neural networks with multiple inputs and outputs, you saw

this neural network that made predictions using two input values, no more and no less.

And in the StatQuest on Deep Learning Image Classification, you saw a neural network

that made a prediction using an image that was six pixels by six pixels, no bigger

and no smaller. However, now we need a neural network that can make a prediction

using the nine values we have for the blue company and make a prediction using the

five values we have for the red company. The good news is that one way to deal with

the problem of having different amounts of input values is to use a Recurrent Neural

Network. Just like the other neural networks that we've seen before, recurrent neural

networks have weights, biases, layers and activation functions. The big difference

is that recurrent neural networks also have feedback loops. And, although this neural

network may look like it only takes a single input value, the feedback loop makes

it possible to use sequential input values, like stock market prices collected over time, to make predictions.

To understand how, exactly, this recurrent neural network can make predictions with

sequential input values, let's run some of StatLand's stock market data through it.

In StatLand, if the price of a stock is low for two days in a row, then, more often

than not, the price remains low on the next day. In other words, if yesterday and

today's stock price is low, then tomorrow's price should also be low. In contrast,

if yesterday's price was low and today's price is medium, then tomorrow's price should

be even higher. And when the price decreases from high to medium, then tomorrow's

price will be even lower. Lastly, if the price stays high for two days in a row,

then the price will be high tomorrow. Now that we see the general trends in stock

prices in StatLand, we can talk about how to run yesterday and today's data through

a recurrent neural network to predict tomorrow's price. The first thing we'll do

is scale the prices so that low equals 0, medium equals 0.5, and high equals 1.

Now let's run the values for yesterday and today through this recurrent neural network

and see if it can correctly predict tomorrow's value. Now, because the recurrent

neural network has a feedback loop, we can enter yesterday and today's values into

the input sequentially. We'll start by plugging yesterday's value into the input.

Now we can do the math just like we would for any other neural network. Beep. Boop.

Beep. Boop. Boop. At this point, the output from the activation function, the y axis

coordinate that we will call Y sub 1, can go two places. First Y sub 1 can go towards

the output. And if we go that way and do the math beep, boop, boop then the output

is the predicted value for today. However, we're not interested in the predicted

value for today because we already have the actual value for today. Instead, we want

to use both yesterday and today's value to predict tomorrow's value. So, for now,

we'll ignore this output, and instead, focus on what happens with this feedback loop.

The key to understanding how the feedback loop works is this summation. The summation

allows us to add Y sub 1 times W sub 2, which is based on yesterday's value, to the

value from today times W sub 1. In other words, the feedback loop allows both yesterday

and today's values to influence the prediction.

Hey, this feedback loop has got me all turned around. Is there an easier way to see how this works?

Yes! There's an easier way to see what's going on. Instead of having to remember which

value is in the loop, and which value is in the input, we can unroll the feedback

loop by making a copy of the neural network for each input value. Now, instead of

pointing the feedback loop to the sum in the first copy, we can point it to the sum

in the second copy. By unrolling the recurrent neural network, we end up with a new

network that has two inputs and two outputs. The first input is for yesterday's value,

and if we do the math straight through to the first output like we did earlier, we

get the predicted value for today. However, as we saw earlier, we can ignore this

output. This second input is for today's value and the connection between the first

activation function and the second summation allows both yesterday and today's values

to influence the final output, which gives us the predicted value for tomorrow. Now,

when we put yesterday's value into the first input and we do the math just like before

beep boop beep boop then we follow the connection from the first activation function

to the summation in the second copy of the neural network. Now we put today's value

into the second input and keep doing the math beep boop beep boop beep boop beep

and that gives us the predicted value for tomorrow, zero,

which is consistent with the original observation. In other words, the recurrent neural

network correctly predicted tomorrow's value.

Likewise,

when we run yesterday and today's values for the other scenarios through the recurrent

neural network we predict a correct values for tomorrow.

This recurrent neural network performs great with two days worth of data, but what if we have three days of data?

When we want to use three days of data to make a prediction about tomorrow's price,

like this, then we just keep unrolling the recurrent neural network until we have

an input for each day of data. Then we plug the values into the inputs, always from

the oldest to the newest. In this case, that means we start by plugging in the value

for the day before yesterday, then we plug in yesterday's value, and then we plug

in today's value. And when we do the math, the last output gives us the prediction

for tomorrow. NOTE: Regardless of how many times we unroll a recurrent neural network,

the weights and biases are shared across every input. In other words, even though

this unrolled network has three inputs the weight, W sub 1, is the same for all three

inputs. And the bias, B sub 1, is also the same for all three inputs. Likewise, all

of the other weights and biases are shared. So, no matter how many times we unroll

a recurrent neural network, we never increase the number of weights and biases that we have to train.

Okay, now that we've talked about what makes basic recurrent neural networks so cool,

let's briefly talk about why they are not used very often. One big problem is that

the more we unroll a recurrent neural network, the harder it is to train. This problem

is called The Vanishing / Exploding Gradient Problem. Which is also known as the "hey wait, where the gradient go?"

problem. In our example, The Vanishing / Exploding Gradient Problem has to do

with the weight along the squiggle that we copy each time we unroll the network. NOTE:

To make it easier to understand the Vanishing / Exploding Gradient Problem, we're

going to ignore the other weights and biases in this network and just focus on W

sub 2. Also, just to remind you when we optimize neural networks with backpropagation,

we first find the derivatives, or gradients, for each parameter. We then plug those

gradients into the gradient descent algorithm to find the parameter values that minimize

a loss function, like the sum of the squared residuals. Bam. Now, even though the

Vanishing / Exploding Gradient Problem starts with Vanishing, we're going to start by showing how a gradient can explode.

In our example, the gradient will explode when we set W sub 2 to any value larger than one.

So let's set W sub 2 equal to 2. Now, the first input value, input sub 1, will be

multiplied by 2 on the first squiggle and then multiplied by 2 on the next squiggle

and again on the next squiggle and again on the last squiggle. In other words, since

we unrolled the recurrent neural network four times,

we multiply the input value by W sub 2, which is 2, raised to the number of times

we unrolled, which is 4. And that means the first input value is amplified 16 times

before it gets to the final copy of the network. Now, if we had 50 sequential days

of stock market data, which to be honest, really isn't that much data, then we would

unroll the network 50 times, and 2 raised to the 50 power is a huge number. And this

huge number is why they call this an Exploding Gradient Problem.

If we tried to train this recurrent neural network with backpropagation, this huge

number would find its way into some of the gradients,

and that would make it hard to take small steps to find the optimal weights and biases.

In other words, in order to find the parameter values that give us the lowest value

for the loss function, we usually want to take relatively small steps. Bam. However,

when the gradient contains a huge number, then we'll end up taking relatively large

steps and instead of finding the optimal parameter we will just bounce around a lot.

Bummer. One way to prevent the Exploding Gradient Problem would be to limit W Sub

2 to values less than 1. However, this results in the Vanishing Gradient Problem. "Hey, wait, where the gradient go?"

To illustrate the Vanishing Gradient Problem, let's set W sub 2 to 0.5. Now, just

like before, we multiply the first input by W sub 2 raised to the number of times

we unroll the network. So if we have 50 sequential input values, that means multiplying

input sub 1 by 0.5 raised to the 50th power and 0.5 raised to the 50th power is a number super close to zero.

Because this number is super close to zero. This is called The Vanishing Gradient Problem.

Now when optimizing a parameter, instead of taking steps that are too large, we end

up taking steps that are too small. And as a result, we end up hitting the maximum

number of steps we are allowed to take before we find the optimal value.

Hey, Josh, these Vanishing / Exploding Gradients are a total bummer. Is there anything we can do about them?

Yes, and we'll talk about a popular solution called Long Short-Term Memory Networks in the next StatQuest.

Now, it's time for some Shameless Self Promotion.

If you want to review statistics and machine learning offline, check out my book.

The StatQuest Illustrated Guide to Machine Learning at statquest.org. It's over 300 pages of total awesomeness.

Hooray! We've made it to the the end of another exciting StatQuest. If you liked this

StatQuest and want to see more, please subscribe. And if you want to support StatQuest,

consider contributing to my patreon campaign, becoming a channel member buying one

or two of my original songs or a t-shirt or a hoodie, or just donate. The links are

in the description below. Alright, until next time Quest on!

Loading...

Loading video analysis...