Reinforcement learning is terrible – Andrej Karpathy

By Dwarkesh Clips

Summary

## Key takeaways - **Reinforcement learning is terrible**: Reinforcement learning is significantly worse than commonly perceived because it upweights every single token in a trajectory that leads to a correct answer, even if many parts of the path were incorrect. This leads to a noisy signal, as it assumes every step taken was correct, rather than identifying specific correct actions. [00:11], [01:01] - **RL sucks supervision through a straw**: Reinforcement learning is inefficient because it takes a long trajectory of actions, potentially minutes of rollout, and extracts only the final reward signal. This single bit of supervision is then broadcast across the entire trajectory, upweighting or downweighting it, which is a highly inefficient way to learn. [01:43] - **Humans learn differently than RL**: Unlike reinforcement learning, humans do not perform hundreds of rollouts. When a human finds a solution, they engage in a complex review process, identifying which parts of their actions were good and which were not, and consciously adjust their approach. [02:03] - **Imitation learning was miraculous**: Fine-tuning large language models by imitation learning was surprisingly effective. Taking a base model, which is essentially autocomplete, and fine-tuning it on conversational text allowed it to rapidly adapt to become conversational while retaining its pre-trained knowledge. [02:29], [02:52] - **RL can discover novel solutions**: Despite its flaws, reinforcement learning allows models to optimize on reward functions and discover solutions that humans might not conceive of, especially for problems with clear correct answers where expert trajectories aren't necessary. [03:23], [03:34] - **Future LLM algorithms need updates**: Recent papers, like one from Google, are exploring 'reflect and review' ideas, suggesting a significant update to LLM algorithms is needed. The speaker anticipates several more breakthroughs in this area to improve how LLMs learn. [03:44]

Topics Covered

Why Reinforcement Learning Is Fundamentally Flawed for LLMs.
The Critical Need for Human-like Reflection in LLMs.
How InstructGPT Revolutionized Conversational AI with Imitation Learning.

Full Transcript

uh humans don't use reinforcement

learning is maybe what I've as I've said

it all I I think they do something

different where is yeah you experience

so reinforcement learning is a lot worse

than I think the average person thinks

reinforcement learning is terrible

it just so happens that uh everything

that we had before is much worse

u because previously we're just

imitating people so it has all these

issues um so in reinforcement learning

say you're working with you're solving a

math problem because it's very simple

you're given a math problem and you're

trying to find the solution. Um, now in

reinforcement learning, you will try

lots of things in parallel first. So,

uh, you're given a problem, you try

hundreds of different attempts and these

attempts can be complex, right? They can

be like, oh, let me try this, let me try

that, this didn't work, that didn't

work, etc. And then maybe you get an

answer. And now you check the back of

the book and you see, okay, the correct

answer is this. And then you can see

that, okay, this one, this one, and that

one got the correct answer, but these

other 97 of them didn't. So literally

what reinforcement learning does is it

goes to the ones that worked really well

and every single thing you did along the

way every single token gets upweighted

of like do more of this. The problem

with that is I mean people will say that

u your estimator has high variance but

what I mean it's just noisy it's noisy.

Uh so basically it kind of almost

assumes that every single little piece

of the solution that you made that ride

the right answer was correct thing to do

which is not true. Like you may have

gone down the wrong alleys until you

arrive the right solution. Every single

one of those incorrect things you did,

as long as you got to the correct

solution, will be upweed as do more of

this. It's terrible.

>> Yeah, it's noise. You've done all this

work only to find a single at the end,

you get a single number of like, oh, you

did correct. And and based on that, you

weigh that entire trajectory as like

upweight or down weight. And so you're

the way I like to put it is you're

sucking supervision through a straw. Uh

because you've done all this work that

could be a minute of rollout and you're

you're like sucking the bits of

supervision of the final reward signal

through a straw and you're like putting

it you're like

basically like um yeah you're

broadcasting that across the entire

trajectory and using that to upway or

down with that trajectory. It's crazy. A

human would never do this. Number one, a

human would never do hundreds of

rollouts. Uh, number two, when a person

sort of finds a solution, they will have

a pretty complicated process of review

of like, okay, I think these parts that

I did well, these parts I did not do

that well, I should probably do this or

that. And they think through things.

There's nothing in current LLM that does

this. There's no equivalent of it. Um,

but I do see papers popping out that are

trying to do this because it's obvious

to everyone in the field.

>> Yeah.

>> So, I kind of see as like the first

imitation learning actually, by the way,

was extremely surprising and miraculous

and amazing that we can uh fine-tune by

imitation on humans. Um, and that was

incredible because in the beginning all

we had was base models. Base models are

autocomplete. Uh, and it wasn't obvious

to me at the time. Uh, and I had to

learn this and the paper that like blew

my mind was instruct GPT because it

pointed out that hey, you can take the

pre-trained model which is autocomplete.

And if you just fine-tune it on text

that looks like conversations, the model

will very rapidly adapt to become very

conversational and it keeps all the

knowledge from pre-training. And this

blew my mind because I didn't understand

that it's just like stylistically can

adjust so quickly and become an

assistant to a user through through just

a few loops of fine-tuning on that kind

of data. It was very miraculous to me

that that that worked. So incredible.

And that was like two years three years

of work. And now came RL. And RL allows

you to do a bit better than just

imitation learning, right? Because you

you can't have these re um reward

functions and you can hill climb on the

reward functions. And so some problems

have just correct answers. we can hill

climb on that without getting expert

trajectories to imitate. So that's

amazing and the model can also discover

solutions that the human might never

come up with.

>> Uh so this is incredible and yet it's

still stupid.

>> Um so I think we need we need more and

so I saw a paper from Google yesterday

that tried to have this reflect and

review idea in mind memory bank paper or

something. I don't know. I've actually

seen a few papers along these lines. So

I expect there to be some kind of a

major update to how we do algorithms for

LLMs coming in that realm. And then I

think we need three or four or five

more.

Um, something like that. If you enjoyed

this clip, you can watch the full

episode here and subscribe for more

clips. Thanks.

Loading...

Loading video analysis...