What Are Large Reasoning Models (LRMs)? Smarter AI Beyond LLMs

By IBM Technology

Summary

## Key takeaways - **LRMs: AI that thinks before it speaks**: Large Reasoning Models (LRMs) go beyond Large Language Models (LLMs) by first planning, weighing options, and double-checking calculations in a sandbox before generating a response, enabling them to reason, plan, and solve problems. [00:12], [00:46] - **LLMs vs. LRMs: Use cases**: While LLMs are sufficient for tasks like writing social media posts, LRMs are better suited for complex problems such as debugging stack traces or tracing cash flow through multiple shell companies, as their internal chain of thought allows them to test hypotheses and discard dead ends. [01:06], [01:20] - **LRM Training: Reasoning-focused tuning**: LRMs are typically built upon pre-trained LLMs and then undergo specialized reasoning-focused tuning using curated datasets of logic puzzles, multi-step math problems, and coding tasks, each with a full chain-of-thought answer key. [02:06], [02:33] - **Reinforcement Learning for LRMs**: LRMs learn through reinforcement learning, receiving feedback (thumbs up/down) from humans or smaller judging models on each step of their reasoning process to maximize rewards and improve logical coherence. [03:40], [04:14] - **LRM Inference: Balancing thinking time and cost**: The 'thinking time' or compute allocated to an LRM at runtime can be adjusted based on the question's complexity, but each extra pass, self-check, or external tool call increases latency and processing costs. [05:19], [05:31] - **LRMs excel at complex reasoning**: LRMs offer advanced reasoning capabilities for tasks requiring multi-step logic and planning, generally need less prompt engineering, and are currently leading AI benchmarks due to their ability to think through responses rather than just predict the next word. [06:47], [08:04]

Topics Covered

How do Large Reasoning Models truly think?
How are advanced reasoning models specifically trained?
Beyond basic tuning: advanced LRM training methods.
How to balance LRM inference time and accuracy?
What are the true costs and benefits of LRMs?

Full Transcript

You already know large language models or LLMs. They predict the

next token in a sequence, using a statistical pattern matching technique to crank out human

like text. There's also LRMs, large reasoning models,

and they go a bit further. They think before they talk. Now, give LLM a prompt

and it will projectile predict whatever word statistically fits, next it will

output a token and then another token, and then another token

and LRMs, they still do that too. But they first they sketch out

a plan. They weigh options and they double check calculations in a sandbox before building their

response. So before they start outputting tokens, they will plan, they will evaluate

what comes back and eventually that will lead to an answer. and

those extra steps, they can matter. Now, if your question is to write a fun social

media post, well then LLMs reflex is usually fine. But if your question is

debug this gnarly stack trace, or perhaps its trace my cash flow through four different shell

companies? Well, reflex isn't enough. The LRMs internal chain of thought lets it test

hypotheses and discard dead ends and land on a reasoned answer, rather than just following a

statistically likely pattern. Now, of course, this doesn't come for free. It costs inference time and

GPU dollars each extra pass through the network, each self-check, each search branch. It all adds

latency and processing time. So LRMs they buy you deeper reasoning at the

cost of a longer, pricier think. So how do you build one of these thinking

machines? Well, an LRM usually builds upon an existing LLM that has undergone

a set of massive pre-training. So this is the stage where we

teach a model about the world. So billions of web pages, books, code, repos and the

like. And this gives it language skills and a a broad knowledge base. And then

after the pre-training an LRM undergoes specialized reasoning focused tuning. So

we're now going to fine tune the model specifically to provide

reasoning capabilities. So an LRM, it's fed curated data sets of logic puzzles and

multi-step math problems and tricky coding tasks. And each one of these examples comes with a full

chain of thought answer key, and the model learns to show its work. So it basically starts

with a problem that its been given. And from that problem, its job is to come up with

a plan for a solution. Once it's come up with a plan, it needs to execute that plan, which will be

in multiple steps. So we might go to step one, step two, and so forth. And then

ultimately the model needs to arrive at a solution. It's learning to

reason. Then we let the model loose trying to solve some fresh problems of its

own. And that's where it goes through a process of reinforcement learning.

Now, that uses a reward system where either humans through reinforcement learning from

human feedback. So RLHF, they give thumbs up or thumbs down for each

one of these steps as they're written. Or the reinforcement learning can come from smaller

models that are really judging models like process reward models and process

reward models. Judge each step of a reasoning chain is good or as bad. And the reasoning

model learns via this reinforcement learning to generate sequences of thoughts that maximize

these thumbs up rewards, ultimately improving its logical coherence. Now, there are some other

training methods that can be used as well. For example, we might choose to use something called

distillation to train the model further. And that's where we have a larger teacher model

that's used to generate reasoning traces. And then those reasoning steps are used to train a smaller

model or a newer model on those traces. So basically, if the advanced teacher model can solve

a puzzle by thinking through a solution, that solution path can then be added to the training

data of the new LRM model. And the result of all of this is a model that can plan, that can verify,

and that can explain. Ready to finally make sense of those shell company cash flows. So

the LRM is trained to think. And now the question is how much thinking time do you give it at

runtime? Well, that's a question all about inference time or test time, as it sometimes

called compute. This is what happens every time you ask a question. And different questions can be

assigned different amounts of thinking time. So debug my stack trace. That might get a good amount

of compute allowance while write a fine caption. That kind of gets the budget version where the

model just goes through one quick pass, and during extended inference time, a model may run

multiple chains of thought. Then it might vote on the best one. It might backtrack with a tree

search if it hits a dead end, and it might call external stuff like a a calculator or a database,

or a code sandbox for spot checks and each extra pass through the model, well, it comes

at a cost. It comes at a cost of more compute that is needed. It comes at

a cost of how long you're going to be waiting for a response with higher latency. But hopefully

this all does come also with an increase in accuracy, higher accuracy.

So is this accuracy up arrow worth the cost to get it? Well it

depends on the problem you're trying to solve. Now on the positive side, an LRM offers

complex reasoning. LRMs excel are tasks that require multi-step logic planning or abstract

reasoning. They also offer improved decision making because LRMs can internally

verify and deliberate, which means that answers tend to be a bit more nuanced and hopefully more

accurate and LRMs they usually require less in the way of prompt engineering. We don't need

to sprinkle in magic words in our prompting like, let's think step by step because the

model already does it. That's less prompt hackery, but you might be better off with a

regular LLM or just a smaller model overall in some situations, because as I've mentioned, there

is that higher computational cost. That means more VRAM, more energy. Higher

invoice price from your cloud provider. And then there's also the increase in latency.

Slower replies while the model stops to think. Although I'm endlessly kind of

amused by reading those replies, the model's thinking steps as it works through building a

response. But that's probably just me. So. So look with LRMs, AI models are no

longer just spewing language out at you as fast as they can predict the next word in a sentence,

they are taking time to think through responses. And today the most intelligent models,

the ones scoring highest on AI benchmarks, well, they tend to be the reasoning models

the LRMs.

Loading...

Loading video analysis...