Recurrent Neural Networks (RNNs), Clearly Explained!!!
By StatQuest with Josh Starmer
Summary
## Key takeaways - **RNNs handle variable-length sequential data**: Recurrent Neural Networks are designed to handle sequential data where the amount of input varies, unlike previous neural networks that were fixed to a specific number of inputs. [03:56] - **RNNs use feedback loops for memory**: The key difference in RNNs is the feedback loop, which allows the network to use sequential input values, like stock prices over time, to make predictions by incorporating past information. [04:24], [07:12] - **Unrolling RNNs visualizes sequential processing**: Unrolling an RNN's feedback loop creates a new network with a copy for each input value, making it easier to visualize how both past and present data influence the final prediction. [07:49], [08:14] - **Shared weights are crucial in unrolled RNNs**: Even when unrolled, RNNs share the same weights and biases across all inputs, meaning the number of parameters to train does not increase with more sequential data. [10:39], [10:45] - **Vanishing/Exploding Gradients hinder RNN training**: As RNNs are unrolled further, gradients can either explode (when weights are > 1) or vanish (when weights are < 1), making it difficult to train the network effectively. [11:24], [13:48] - **Gradient explosion amplifies early inputs**: If a weight is set to 2, unrolling the network multiple times exponentially amplifies the initial input value, leading to extremely large gradients that disrupt training. [12:51], [13:15]
Topics Covered
- Recurrent Neural Networks handle variable-length sequential data.
- Recurrent Neural Networks use feedback loops for sequential memory.
- Unrolling RNNs visualizes sequential processing and shared weights.
- Shared weights across unrolled RNNs prevent parameter explosion.
- Vanishing/Exploding Gradients hinder RNN training with long sequences.
Full Transcript
Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to talk about
recurrent neural networks, and they're going to be clearly explained!
Lightning and Grid are totally cool. Check them out when you've got some time.
NOTE: This StatQuest assumes that you are already familiar with the main ideas behind
neural networks, backpropagation and the ReLU activation function. If not, check
out the 'Quest. ALSO NOTE: Although basic, or vanilla recurrent neural networks are
awesome, they are usually thought of as a stepping stone to understanding fancier
things like Long Short-Term Memory Networks and Transformers, which we will talk
about in future StatQuests. In other words, every Quest worth taking take steps,
and this is the first step. So with that said, let's say Hi to StatSquatch. Hi! And StatSquatch says,
Hello! The other day I bought stock in a company called Get Rich Quick, but the next
day their stock price went down and I lost money. Bummer. So, I was thinking, maybe
we could create a neural network to predict stock prices. Wouldn't that be cool?
That sure would be cool 'Squatch, unfortunately the actual stock market is crazy complicated
and we'd probably both get in a lot of trouble if we offered advice on how to make
money with it, but if we go to that mystical place called StatLand things are much
simpler and there are far fewer lawyers. So let's build a neural network that predicts
stock prices in StatLand. However, first let's just talk about stock market data
in general. When we look at stock prices, they tend to change over time. For example,
the price of this stock went up for four days before going down. Also, the longer
a company has been traded on the stock market, the more data we'll have for it. For
example, we have more time points for the company represented by the blue line then
we have for the company represented by the red line. What that means is, if we want
to use a neural network to predict stock prices, then we need a neural network that
works with different amounts of sequential data. In other words, if we want to predict
the stock price for the Blue Line company on day 10, then we might want to use the
data from all nine of the preceding days. In contrast, if we wanted to predict the
stock price for the Red Line company on day 10, then we would only have data for
the preceding five days. So we need the neural network to be flexible in terms of
how much sequential data we use to make a prediction.
This is a big difference compared to the other neural networks we've looked at in
this series. For example, in Neural Networks Clearly Explained, we examined a neural
network that made predictions using one input value, no more and no less. And if
you saw the StatQuest on neural networks with multiple inputs and outputs, you saw
this neural network that made predictions using two input values, no more and no less.
And in the StatQuest on Deep Learning Image Classification, you saw a neural network
that made a prediction using an image that was six pixels by six pixels, no bigger
and no smaller. However, now we need a neural network that can make a prediction
using the nine values we have for the blue company and make a prediction using the
five values we have for the red company. The good news is that one way to deal with
the problem of having different amounts of input values is to use a Recurrent Neural
Network. Just like the other neural networks that we've seen before, recurrent neural
networks have weights, biases, layers and activation functions. The big difference
is that recurrent neural networks also have feedback loops. And, although this neural
network may look like it only takes a single input value, the feedback loop makes
it possible to use sequential input values, like stock market prices collected over time, to make predictions.
To understand how, exactly, this recurrent neural network can make predictions with
sequential input values, let's run some of StatLand's stock market data through it.
In StatLand, if the price of a stock is low for two days in a row, then, more often
than not, the price remains low on the next day. In other words, if yesterday and
today's stock price is low, then tomorrow's price should also be low. In contrast,
if yesterday's price was low and today's price is medium, then tomorrow's price should
be even higher. And when the price decreases from high to medium, then tomorrow's
price will be even lower. Lastly, if the price stays high for two days in a row,
then the price will be high tomorrow. Now that we see the general trends in stock
prices in StatLand, we can talk about how to run yesterday and today's data through
a recurrent neural network to predict tomorrow's price. The first thing we'll do
is scale the prices so that low equals 0, medium equals 0.5, and high equals 1.
Now let's run the values for yesterday and today through this recurrent neural network
and see if it can correctly predict tomorrow's value. Now, because the recurrent
neural network has a feedback loop, we can enter yesterday and today's values into
the input sequentially. We'll start by plugging yesterday's value into the input.
Now we can do the math just like we would for any other neural network. Beep. Boop.
Beep. Boop. Boop. At this point, the output from the activation function, the y axis
coordinate that we will call Y sub 1, can go two places. First Y sub 1 can go towards
the output. And if we go that way and do the math beep, boop, boop then the output
is the predicted value for today. However, we're not interested in the predicted
value for today because we already have the actual value for today. Instead, we want
to use both yesterday and today's value to predict tomorrow's value. So, for now,
we'll ignore this output, and instead, focus on what happens with this feedback loop.
The key to understanding how the feedback loop works is this summation. The summation
allows us to add Y sub 1 times W sub 2, which is based on yesterday's value, to the
value from today times W sub 1. In other words, the feedback loop allows both yesterday
and today's values to influence the prediction.
Hey, this feedback loop has got me all turned around. Is there an easier way to see how this works?
Yes! There's an easier way to see what's going on. Instead of having to remember which
value is in the loop, and which value is in the input, we can unroll the feedback
loop by making a copy of the neural network for each input value. Now, instead of
pointing the feedback loop to the sum in the first copy, we can point it to the sum
in the second copy. By unrolling the recurrent neural network, we end up with a new
network that has two inputs and two outputs. The first input is for yesterday's value,
and if we do the math straight through to the first output like we did earlier, we
get the predicted value for today. However, as we saw earlier, we can ignore this
output. This second input is for today's value and the connection between the first
activation function and the second summation allows both yesterday and today's values
to influence the final output, which gives us the predicted value for tomorrow. Now,
when we put yesterday's value into the first input and we do the math just like before
beep boop beep boop then we follow the connection from the first activation function
to the summation in the second copy of the neural network. Now we put today's value
into the second input and keep doing the math beep boop beep boop beep boop beep
and that gives us the predicted value for tomorrow, zero,
which is consistent with the original observation. In other words, the recurrent neural
network correctly predicted tomorrow's value.
Likewise,
when we run yesterday and today's values for the other scenarios through the recurrent
neural network we predict a correct values for tomorrow.
This recurrent neural network performs great with two days worth of data, but what if we have three days of data?
When we want to use three days of data to make a prediction about tomorrow's price,
like this, then we just keep unrolling the recurrent neural network until we have
an input for each day of data. Then we plug the values into the inputs, always from
the oldest to the newest. In this case, that means we start by plugging in the value
for the day before yesterday, then we plug in yesterday's value, and then we plug
in today's value. And when we do the math, the last output gives us the prediction
for tomorrow. NOTE: Regardless of how many times we unroll a recurrent neural network,
the weights and biases are shared across every input. In other words, even though
this unrolled network has three inputs the weight, W sub 1, is the same for all three
inputs. And the bias, B sub 1, is also the same for all three inputs. Likewise, all
of the other weights and biases are shared. So, no matter how many times we unroll
a recurrent neural network, we never increase the number of weights and biases that we have to train.
Okay, now that we've talked about what makes basic recurrent neural networks so cool,
let's briefly talk about why they are not used very often. One big problem is that
the more we unroll a recurrent neural network, the harder it is to train. This problem
is called The Vanishing / Exploding Gradient Problem. Which is also known as the "hey wait, where the gradient go?"
problem. In our example, The Vanishing / Exploding Gradient Problem has to do
with the weight along the squiggle that we copy each time we unroll the network. NOTE:
To make it easier to understand the Vanishing / Exploding Gradient Problem, we're
going to ignore the other weights and biases in this network and just focus on W
sub 2. Also, just to remind you when we optimize neural networks with backpropagation,
we first find the derivatives, or gradients, for each parameter. We then plug those
gradients into the gradient descent algorithm to find the parameter values that minimize
a loss function, like the sum of the squared residuals. Bam. Now, even though the
Vanishing / Exploding Gradient Problem starts with Vanishing, we're going to start by showing how a gradient can explode.
In our example, the gradient will explode when we set W sub 2 to any value larger than one.
So let's set W sub 2 equal to 2. Now, the first input value, input sub 1, will be
multiplied by 2 on the first squiggle and then multiplied by 2 on the next squiggle
and again on the next squiggle and again on the last squiggle. In other words, since
we unrolled the recurrent neural network four times,
we multiply the input value by W sub 2, which is 2, raised to the number of times
we unrolled, which is 4. And that means the first input value is amplified 16 times
before it gets to the final copy of the network. Now, if we had 50 sequential days
of stock market data, which to be honest, really isn't that much data, then we would
unroll the network 50 times, and 2 raised to the 50 power is a huge number. And this
huge number is why they call this an Exploding Gradient Problem.
If we tried to train this recurrent neural network with backpropagation, this huge
number would find its way into some of the gradients,
and that would make it hard to take small steps to find the optimal weights and biases.
In other words, in order to find the parameter values that give us the lowest value
for the loss function, we usually want to take relatively small steps. Bam. However,
when the gradient contains a huge number, then we'll end up taking relatively large
steps and instead of finding the optimal parameter we will just bounce around a lot.
Bummer. One way to prevent the Exploding Gradient Problem would be to limit W Sub
2 to values less than 1. However, this results in the Vanishing Gradient Problem. "Hey, wait, where the gradient go?"
To illustrate the Vanishing Gradient Problem, let's set W sub 2 to 0.5. Now, just
like before, we multiply the first input by W sub 2 raised to the number of times
we unroll the network. So if we have 50 sequential input values, that means multiplying
input sub 1 by 0.5 raised to the 50th power and 0.5 raised to the 50th power is a number super close to zero.
Because this number is super close to zero. This is called The Vanishing Gradient Problem.
Now when optimizing a parameter, instead of taking steps that are too large, we end
up taking steps that are too small. And as a result, we end up hitting the maximum
number of steps we are allowed to take before we find the optimal value.
Hey, Josh, these Vanishing / Exploding Gradients are a total bummer. Is there anything we can do about them?
Yes, and we'll talk about a popular solution called Long Short-Term Memory Networks in the next StatQuest.
Now, it's time for some Shameless Self Promotion.
If you want to review statistics and machine learning offline, check out my book.
The StatQuest Illustrated Guide to Machine Learning at statquest.org. It's over 300 pages of total awesomeness.
Hooray! We've made it to the the end of another exciting StatQuest. If you liked this
StatQuest and want to see more, please subscribe. And if you want to support StatQuest,
consider contributing to my patreon campaign, becoming a channel member buying one
or two of my original songs or a t-shirt or a hoodie, or just donate. The links are
in the description below. Alright, until next time Quest on!
Loading video analysis...