Reinforcement learning is terrible – Andrej Karpathy
By Dwarkesh Clips
Summary
## Key takeaways - **Reinforcement learning is terrible**: Reinforcement learning is significantly worse than commonly perceived because it upweights every single token in a trajectory that leads to a correct answer, even if many parts of the path were incorrect. This leads to a noisy signal, as it assumes every step taken was correct, rather than identifying specific correct actions. [00:11], [01:01] - **RL sucks supervision through a straw**: Reinforcement learning is inefficient because it takes a long trajectory of actions, potentially minutes of rollout, and extracts only the final reward signal. This single bit of supervision is then broadcast across the entire trajectory, upweighting or downweighting it, which is a highly inefficient way to learn. [01:43] - **Humans learn differently than RL**: Unlike reinforcement learning, humans do not perform hundreds of rollouts. When a human finds a solution, they engage in a complex review process, identifying which parts of their actions were good and which were not, and consciously adjust their approach. [02:03] - **Imitation learning was miraculous**: Fine-tuning large language models by imitation learning was surprisingly effective. Taking a base model, which is essentially autocomplete, and fine-tuning it on conversational text allowed it to rapidly adapt to become conversational while retaining its pre-trained knowledge. [02:29], [02:52] - **RL can discover novel solutions**: Despite its flaws, reinforcement learning allows models to optimize on reward functions and discover solutions that humans might not conceive of, especially for problems with clear correct answers where expert trajectories aren't necessary. [03:23], [03:34] - **Future LLM algorithms need updates**: Recent papers, like one from Google, are exploring 'reflect and review' ideas, suggesting a significant update to LLM algorithms is needed. The speaker anticipates several more breakthroughs in this area to improve how LLMs learn. [03:44]
Topics Covered
- Why Reinforcement Learning Is Fundamentally Flawed for LLMs.
- The Critical Need for Human-like Reflection in LLMs.
- How InstructGPT Revolutionized Conversational AI with Imitation Learning.
Full Transcript
uh humans don't use reinforcement
learning is maybe what I've as I've said
it all I I think they do something
different where is yeah you experience
so reinforcement learning is a lot worse
than I think the average person thinks
reinforcement learning is terrible
it just so happens that uh everything
that we had before is much worse
u because previously we're just
imitating people so it has all these
issues um so in reinforcement learning
say you're working with you're solving a
math problem because it's very simple
you're given a math problem and you're
trying to find the solution. Um, now in
reinforcement learning, you will try
lots of things in parallel first. So,
uh, you're given a problem, you try
hundreds of different attempts and these
attempts can be complex, right? They can
be like, oh, let me try this, let me try
that, this didn't work, that didn't
work, etc. And then maybe you get an
answer. And now you check the back of
the book and you see, okay, the correct
answer is this. And then you can see
that, okay, this one, this one, and that
one got the correct answer, but these
other 97 of them didn't. So literally
what reinforcement learning does is it
goes to the ones that worked really well
and every single thing you did along the
way every single token gets upweighted
of like do more of this. The problem
with that is I mean people will say that
u your estimator has high variance but
what I mean it's just noisy it's noisy.
Uh so basically it kind of almost
assumes that every single little piece
of the solution that you made that ride
the right answer was correct thing to do
which is not true. Like you may have
gone down the wrong alleys until you
arrive the right solution. Every single
one of those incorrect things you did,
as long as you got to the correct
solution, will be upweed as do more of
this. It's terrible.
>> Yeah, it's noise. You've done all this
work only to find a single at the end,
you get a single number of like, oh, you
did correct. And and based on that, you
weigh that entire trajectory as like
upweight or down weight. And so you're
the way I like to put it is you're
sucking supervision through a straw. Uh
because you've done all this work that
could be a minute of rollout and you're
you're like sucking the bits of
supervision of the final reward signal
through a straw and you're like putting
it you're like
basically like um yeah you're
broadcasting that across the entire
trajectory and using that to upway or
down with that trajectory. It's crazy. A
human would never do this. Number one, a
human would never do hundreds of
rollouts. Uh, number two, when a person
sort of finds a solution, they will have
a pretty complicated process of review
of like, okay, I think these parts that
I did well, these parts I did not do
that well, I should probably do this or
that. And they think through things.
There's nothing in current LLM that does
this. There's no equivalent of it. Um,
but I do see papers popping out that are
trying to do this because it's obvious
to everyone in the field.
>> Yeah.
>> So, I kind of see as like the first
imitation learning actually, by the way,
was extremely surprising and miraculous
and amazing that we can uh fine-tune by
imitation on humans. Um, and that was
incredible because in the beginning all
we had was base models. Base models are
autocomplete. Uh, and it wasn't obvious
to me at the time. Uh, and I had to
learn this and the paper that like blew
my mind was instruct GPT because it
pointed out that hey, you can take the
pre-trained model which is autocomplete.
And if you just fine-tune it on text
that looks like conversations, the model
will very rapidly adapt to become very
conversational and it keeps all the
knowledge from pre-training. And this
blew my mind because I didn't understand
that it's just like stylistically can
adjust so quickly and become an
assistant to a user through through just
a few loops of fine-tuning on that kind
of data. It was very miraculous to me
that that that worked. So incredible.
And that was like two years three years
of work. And now came RL. And RL allows
you to do a bit better than just
imitation learning, right? Because you
you can't have these re um reward
functions and you can hill climb on the
reward functions. And so some problems
have just correct answers. we can hill
climb on that without getting expert
trajectories to imitate. So that's
amazing and the model can also discover
solutions that the human might never
come up with.
>> Uh so this is incredible and yet it's
still stupid.
>> Um so I think we need we need more and
so I saw a paper from Google yesterday
that tried to have this reflect and
review idea in mind memory bank paper or
something. I don't know. I've actually
seen a few papers along these lines. So
I expect there to be some kind of a
major update to how we do algorithms for
LLMs coming in that realm. And then I
think we need three or four or five
more.
Um, something like that. If you enjoyed
this clip, you can watch the full
episode here and subscribe for more
clips. Thanks.
Loading video analysis...