RLM Theory Overview feat. Alex L. Zhang | long context + REPL + sub-agents
By Deep Learning with Yacine
Summary
Topics Covered
- LLMs Deteriorate Beyond 250k Context
- RLM Offloads Context to REPL Environment
- RLM Excels on Quadratic Complexity Tasks
- Coding Ability Unlocks Long Context Mastery
- Post-Train RLMs on Natural Distributions
Full Transcript
One of the core problem of large language model that like I've been literally yelling about for two years is the high quality long context bottleneck that they have. It is truly difficult to
train LM to handle large context without any loss of performance. Anything larger
than having a 250k context window will start to show sign of deterioration of intelligence. Even the current Frontier
intelligence. Even the current Frontier LM will only have an effective highquality context area of around 40,000 token of size for moderately hard
problem. This is why I got unreasonably
problem. This is why I got unreasonably excited about the release of the recursive language model by Alex Elzang earlier in 2026. I truly think that this paradigm will be uh the one that we will
see more in all our favorite agentic system this year and will yield truly magical improvement intelligence without much added cost in training. In a
nutshell, the RLM is a technique which live more in the family of harness method in which we put the LM inside some sort of environment to do a task.
In our case, the main goal of the environment is to allow the LM to process large amount of context without retraining the model and without loss of performance. For RLM, this environment
performance. For RLM, this environment or harness is a read a valal print loop or ripple system that can be best visualized as like a kind of Jupyter notebook thing. The core idea is that
notebook thing. The core idea is that instead of loading all of the context inside the model directly, you load it into the environment and then you give tools to the LM of your choice, doesn't
matter which is it, to mine the information in that pile of context. The
magic of the RLM is that on top of that environment that allow it to effectively have high quality context in the millions of token range, it also allowed the LM the possibility to kind of spawn
like a sub agent to do some semantic subtask in the RAPL environment. Hence
the name recursive language model. What
we're going to cover today is the full theory about how the RLM work using the information from the academic paper and the blog post from Alex. We'll also chat with the first author directly about
what's the core intuition behind the method and what's next in this paradigm.
I hope you'll enjoy. And also after this video, if you want to go and more into the practical aspect of RLM and get kind of hands-on on the material, I highly recommend uh this video that my friend Neural Abb did on the subject. It's an
awesome channel that goes in depth on very technical deep learning topic. You
should check it out. Let's get into it.
So the paper we're going to review is the RLM paper. Uh you might have seen it uh float around a bit. Um I say it's a great one. Like I really really liked
great one. Like I really really liked how it was written and I really like the just the general flow of it. And
basically it's a um it's kind of a harness u paper where it shows like a new way of of like a plugging an LM into
like a long context type of harness so that they can actually have like ingest more context without having the context rot problem. Um so overall it's great.
rot problem. Um so overall it's great.
It's a lot of pages but like what what matter more is like the first eight pages right? So, it's not like a a
pages right? So, it's not like a a terribly intimidating uh paper to read.
Uh, but what is cool is that they added a lot of information about like what did didn't go wrong. Sorry, what did it go right? Like all of the qualitative
right? Like all of the qualitative information about literally what's up with um the RLM as it goes and ingest uh the context in different um situations.
Um so, no, it's a it's a pretty uh pretty nice one. Uh we're going to be able to check this out uh in its entirety. So uh this paper actually
entirety. So uh this paper actually comes from uh an idea that was uh like postulated in October. Um so um he has
been uh actively working on it and refining uh the idea and now we have a paper uh that was written and um it's like it's exactly like the blog post but
like just pushed uh a bit further. So if
you actually want to understand a bit more also how to um um how ideas in uh research are formed, take a look at the blog post because it's much more um raw
in like the the intuition and like what's what went into uh his head while he was um thinking about like um long context and what he followed in order to
like do the actual uh bigger experiment.
um and you see like uh what what's scaled from here to here. So it's pretty cool. Uh I suggest you to read both,
cool. Uh I suggest you to read both, right? One for the attrition, one for
right? One for the attrition, one for like the the rigorousness of how it's uh it's written, but it's very approachable. Um and this is a long
approachable. Um and this is a long context paper, right? So um it deal with
like having um model ingest much more um context that they can theoretically uh uh ingest normally or um they can
ingest it normally but there's there are in the on the on the frame of the context window where the context is kind of deterating a bit right you you most likely already encountered this when you're like I don't know chatting with
chat GBD or whatever it is and then like as as the discussion go longer and longer, it just it just get dumber and dumber. So the intuition here is that
dumber. So the intuition here is that like what you do generally is like you can summarize the information here or like you take two chat uh windows and try to kind of like merge them into like
a bigger bigger one and you try to distill out some of the information.
It's kind of this idea but coupled with um a fact about LM which is they're very very good at um coding right uh so like
uh that's kind of the the core um intuition here and um what they did in order to like ingest even bigger context that the window um allow uh is really
interesting. they put the prompt that is
interesting. they put the prompt that is absolutely uh massive in some cases like 10 million token um inside an
environment uh like variable right in a pl um setup right so the the prompt that is gigantic is uh there and then the
model doesn't have it just kind of dig into it again and again um through like this type of workflow right where it's like digging into it with like it's a print prompt like the first 100 lines or
whatever it is and then you're going to continue like this and then you're going to slice it up a bit in a a bit in a Jupiter kind of notebooky way right where you have the data set and then you
kind of play with it to understand it it's kind of the same idea um and uh what is interesting is in this case it can actually uh kind of delegate some of the tasks to another agent and the agent
will have a prompt and it will be able to interact with the same thing right in different way but he have subtasks ask right so uh here it's like in chapter one find all it are listed to belong to
whatever right and then the lm will do it and then it will return the response and then it will do this back and forth um it doesn't do recursive like this right with the depth two or three or
four um but it do like a dep one come back to the main main root node um and then it uh at at some point it need to kind of give a final answer right and
this is all happening in a rl environment So um the information is getting stored into variable that is getting created here
like a part one part two um pre-kata post kata whatever it is right and it will just like use these variable to understand okay this is what is in this
variable so then I can use it uh I can create for loop can create a whole bunch of stuff so it's really like free form coding uh that is happening here and at the end it output a response and uh
that's it so this big thing is the RLM.
It's the LLM with uh the environment um that is attached to it. So, um that's just a bit like if you if you look at um the performance, it's actually pretty
pretty uh interesting. Um if you look here and um there's a tree here that I've shown the S um needle in the h stack here, there's the olong and then another version of it
that is a bit harder. So, this one is like linear complexity. This one is like quadratic complexity in term of like uh
length um uh of um of the of the prompt, right? Um so it's much harder even
right? Um so it's much harder even though it's like the same amount of input.
Um and uh GPT5 here is good like a needle in a stack because I don't know it's not it's not enter like yes it's a lot of context but it's not complicated.
You can just do one one scan through it and you're done, right? These ones
require like doing a lot of computation but you have to do it in one pass which is hard. uh when you take the same model
is hard. uh when you take the same model so not post train or anything and you put it into the RLM RS um what you get
is uh still NA stack is good but it's even good outside of the context window which is super interesting like it's past the theoretical limit of what the
model can ingest um which is um let's say like if you've seen my stuff for the past year I've been talking a lot about quality long context because I I really think that like if you if you can crack
the quality long context problem there's a lot of stuff that becomes immediately um uh available right to to to the models to be able to to do that are a
bit more complicated and this is a seems to be a way to do it for free almost right uh there's a cost to do this but it's not crazy this whole region that
was not possible before um is now possible like all of this right and on top of this if you look at the curve they're just getting straightened up uh throughout right so uh the olong is
getting better and the the long pair which is like quadrat quadratic in in like complexity per input length um is also doing great so at 1 million we're still good and then I think they pushed
it to 10 million um which is insane right then and it's still it's still working fine and if you look at the average API cost here for this figure um
it's not terribly ly more costly. You
see here the RM GP5 on on long pairs. So
this one is like kind of scaling uh a bit like this because it's quadratic in complexity. So it has to do a bit more
complexity. So it has to do a bit more processing. Um but it's possible at this
processing. Um but it's possible at this this regime over here which is like just impossible for the other ones. Um and
then like the the other ones for the RLM on along is still doing uh super fine while um these ones the GPD alone are struggling a bit uh sorry the GP alone
are cheaper right but for a lesser performance here. So it is uh more
performance here. So it is uh more costly in uh in general um to to use that stuff. like things get a bit
that stuff. like things get a bit murkier uh over there in a smaller um context because like technically you don't even need to do sub agent call or anything like that. The LM can just like
answer straight up. Um but this is just to show that like yes it's more costly to do this but also compared to the other method that we're going to see um it's not terribly uh costly. So it seems
to be a paradigm that makes a lot of sense. Um so we have uh here from what I
sense. Um so we have uh here from what I understood from the paper basically attach more LMS on the top of the main one and use them as storage kind of but um it's a bit more complicated than that
like yeah the main storage is really this right it's this environment that you have running on the main node right but then you can spawn agent to like
just do something right it's not like I mean you can call it agent but it's like literally like just another RLM right and this RLM could do stuff, right? Like
you could he has a ERPL and you can like do some analysis and and work on there.
Um so that's kind of what's what's going on and you can spawn them at multiple point and then aggregate their information um in various degrees. So
that's kind of what you get uh on this front. Um and uh it's kind of like
front. Um and uh it's kind of like sometimes like you could in in some cases there they could not spawn the sub agent and in some instances it's actually better to not spawn the sub
agent. Um but in other cases if you have
agent. Um but in other cases if you have to do some semantic search type of stuff right or like semantic understanding um you're much better to delegate it to a LM and the LM might just like take the
information and just ingest it and then just do something with it. Um yeah, so that's kind of that's kind of the the uh the element. And if you look here the uh
the element. And if you look here the uh you see here it split the prompt on chapter 2, part one, part two and it
feed this um the context of part one in it, right? You see that? So like it is
it, right? You see that? So like it is like have have a have a tighter slice of what it's um it's it's going to look at.
So it's more like subprocessing of the information than storage. Um but then after that you kind of take all the the information and you and you put it together. Um yeah, so this is a general
together. Um yeah, so this is a general idea. If you look at the long benchmark
idea. If you look at the long benchmark uh that they're using, right? Um for
those that like are maybe not familiar with long uh benchmark, there's not terribly a lot of them, right? And um
it's not just about like having massive uh task. It's also about like having
uh task. It's also about like having tasks that are a bit more complicated, right? If you have that, if you have
right? If you have that, if you have like a bit more complicated task, um then like um what's it like you will actually be able to understand a bit better um how the models are handling it
because if you look at it here like all of these task have the same amount of input context, right? This one is much easier. It's kind of like constant,
easier. It's kind of like constant, right? It's it's not like it's never
right? It's it's not like it's never harder uh that much. But this is why you see like LM4 lama 4 scal being able to do it
to 10 million like because it's a easy task but as soon as you have like bit more complicated um um task it's a bit more difficult and we have even more
difficult than them in the in the benchmark uh use we have the browse comp plus right um so browse comp is a benchmark for browsing agent and then
what they're doing is that um they're actually augmenting it right so they're augmenting the brump uh like a corpus here with information that is a gold
right that is being retrieved by O3 over there right and that is human verified and then uh information that is kind of like noise right which is information
that makes sense in some way but that is not the information that you need right so you have these r negative and then you have this this gold kind of information all piled up into like the
br um uh comp uh benchmark and then this is what they're using in this specific case and you can go up to like uh here they're using 1K document uh the benchmark provide a verified offline
compress of 100k document that is guaranteed content gold evidence and our negative document for each task and then you have like a whole bunch of of task but we are uh using 150 randomly sample
task as our evolation set and then we provide 100 randomly chosen document to the model. So 100 doc is given and then
the model. So 100 doc is given and then there's a bunch of tasks that need to to go there and once you give the 100 document they guarantee that there's a gold evidence in there uh to so that
they they can actually do the stuff. So
this is a bit more difficult than this one, right? So there's some gradation of
one, right? So there's some gradation of difficulty here. Then we have oolong,
difficulty here. Then we have oolong, right? And ulong is there's a whole
right? And ulong is there's a whole bunch of stuff document and then you have to piece the information together on all the document. It's not like you cannot it's not just you going to find
it in one doc. It's going to be like you have to piece information um together in multiple multiple stuff here. same thing
with let's say like this transcript of like 12 hour of dialogue in this whatever game right um so that's Olong right and then they decided to kind of
make it even harder with um uh so they split uh they manually modified the three course uh split of olong to include 20 20 new queries that
specifically require aggregating pairs of chunk to construct the final answer.
So a task will be like this. In the
above data list all pair of user ID uh where both user have at least one in sentent with a numeric value or allocation. Each one of the question can
allocation. Each one of the question can be labeled as one of the label h description abstract concept entity blah blah blah. So like you have to kind of
blah blah. So like you have to kind of take pairs of information and like like find them into this um this um kind of mesh of information and then you can uh
output something useful. So a bit harder. This is supposed to be like
harder. This is supposed to be like quadratic in in difficulty. And the last one is the long bench, right? V2. So the
long bench is like um there's a whole bunch of document that are long that are getting there. You're we're data
getting there. You're we're data annotating them. Uh we're going to
annotating them. Uh we're going to review the their their stuff. There's
some revision that is uh being made and then there's some manual review at the end. Right? So it's like massive mega
end. Right? So it's like massive mega big document and um you have single document QA here. you have multiple document QA. What is interesting for us
document QA. What is interesting for us is this one is the code uh repository understanding. So it's QA 50 questions
understanding. So it's QA 50 questions um on uh like literally code repository, right? And then you have to kind of dig
right? And then you have to kind of dig into this code in order to understand and answer a bunch of questions. Um so
it's this slice that we're taking uh over there. It's actually like it it's a
over there. It's actually like it it's a bit weird, but it's actually um uh how
it like a bit difficult to make good um long context benchmark. Uh even on in the wild like um like let's say you have the multi-turn types thing, right? you
have like there's a big corpus over there I need to dig into but surprisingly there's not that much um at least there's more now but like there was not that much like benchmark that
already were structured in a way that you can test and train the models on and also like it's maybe why some of the model are not that great um on all context because they never seen they
never were trained with uh long uh long enough context um at all method is like this right like the this is this is the
stuff. There's the rl over there the uh
stuff. There's the rl over there the uh environment uh hold the prompt and then like it's loaded as a variable and like you can play with it. Um and that's kind
of that like so like the context is loaded in a variable and then the model has access to it right that's kind of the the the situation that we have here
it can spawn these agent and the product look like this you are tasked with answer query with associate context you can access transform analyze this context interactively in a pl environment that can recursively query
sub element lm like in the ablation you remove this which are strongly encore to use as much as possible will be required iteratively until you provide the final answer, right? And then there's a whole
answer, right? And then there's a whole bunch of stuff, right? They're using the same um roughly the same kind of system prompt for both the the GPD5 and then
the Quentry coder. Uh but in the the country they have to say this like they have to say this this little thing here which is like uh be very
careful about using LMQ query as it incur high runtime cost always batch as much information as reasonably possible to each call. Um
because the thing like the quentry was always pinging agent right for every line or something like that. So they had to stop it and and just restrain it a bit for doing this. GBT5 doesn't do this
out of the box, right? Like it will just like use it properly and stuff. Um but
quentry is just it just doesn't care at all. Uh
all. Uh which uh uh I don't know which I found it kind of funny. So this is a system prompt. You can actually take a look.
prompt. You can actually take a look.
It's in the paper. Um and this is the only thing that is guiding the LM, right? So like technically the RLM is
right? So like technically the RLM is this right? This is the only thing.
this right? This is the only thing.
There's no RL that is being done on this um whatsoever. It's literally literally
um whatsoever. It's literally literally just um good system prompt with the uh RPL environment. That's kind of it,
RPL environment. That's kind of it, right?
Um so this is good. We can then take a look a bit uh more uh over here. There's
a bunch of patterns that are emerging um on the the models, right? They're not
being told explicitly like what to do, but they end up doing a whole bunch of stuff, right? So the type of thing that
stuff, right? So the type of thing that they're doing here is that they can probe and then interact with the probe and then like with some with like reax or like semantic kind of sub agent call.
Um they can defer some of the reasoning um of large context by creating recursive LM calls, right? So they're
going to do a like a function here and then they're going to like recursively ask a bunch of question uh everything in in code also here right and then create the prompt and then shove it there um
and here can stitch recursive LM output to uh to form longer uh composite output uh which is pretty pretty interesting.
Um I really like um the way it's the u it's set up. It's very natural. uh it
feels a bit like a it's discovering what it can do and what it cannot do. Um so a bunch of stuff that we uh pattern that they see is that filtering input using code execution based on model power. So
here element ability to filter input context without explicit explicitly seeing it. So they don't know what's in
seeing it. So they don't know what's in the prompt but they kind of filter whatever they need to filter um based on like their prior information and prior
ideas about like what should be uh in there. It doesn't need to load a lot in
there. It doesn't need to load a lot in order to filter and get what it needs, right? Um stuff like just uh doing reax
right? Um stuff like just uh doing reax on like a bunch of information that should technically be there, right? Uh
or asking you see here what they did like uh find all it that are listed that belong to belong to people. It's doing a lot of inferring what should be in in
this stuff and then actually going over there and then uh doing it or like having another model doing it. Um there
is doing they're doing a lot of chunking like like we saw with there like they're uh like splitting this into chapters and then asking it to to go and fetch information in the chapter. Um they are
doing a lot of verification.
This part is the part that I find funny right like um I don't know like sometime they just panic and then they verify verify
verify. Uh there is a failure case that
verify. Uh there is a failure case that is super funny with I think CR 3. They
just verify forever and then they do they got the right answer and then they they they just discard everything that they did and they they they they say right the wrong answer. It can verify
its information through sub LM calls. Um
and uh it can pass uh pass recursive LM output to variable for long output task.
So like it kind of is a way to kind of safeguard its um context because it's going to like if you see here right like
uh over there right this output might be gigantic you don't know maybe it's like I don't know 20 20k but it actually is
just this for it literally is a one variable right um and it's using these variable and stitching them together it doesn't have to see it. You can just trust it like it's good or you can maybe
verify a bit whatever it is. Um but
that's that's very interesting. Or you
can actually just like look at the length of this and decide if it's a good idea to like open it up and load into context. Um so the the variable encaps
context. Um so the the variable encaps encapsulation is a very interesting um element of how it can safeguard its context. So it's always kind of
context. So it's always kind of operating in like a better um section of the of the context um than like a just
raw raw LM calls. Uh the baseline that are used we have the RLM with the ERPL and we have the RLM with no sub call right so like in the no sub call uh uh
section all of this is just gone right you just don't do that and you just do it there and already it's it's it's it is better in some cases. Um we have also like um they benchmark it against a
summary agent which is another kind of methodology that you can have which is basically like literally like you you have summarization happening. There's
different ways of doing it, right? But
it always is like whatever happened in the past is now summarized in like a smaller format and then like you take that and then you add the other piece of context and then you keep on going and
then once you like once it's too big you compat you compress it and then you keep on doing it a bit like what's happening with like claw and stuff right um kind of compaction type of mythology. The
issue with this is that if the information is very sensitive right in term of like its actual format it can it can kind of get lost symmetrically
after too many compaction. Um so it's like lossy uh long context in this specific case. So they're doing a mix of
specific case. So they're doing a mix of this right and uh what they're actually doing is this over here. um in an interative fashion the agent is giving input until its context is full at which
point it's created to summarize all relevant information and continue right so that's what's happening and they're using like a smaller agent for summarization um there last one that
they're uh benchmarking is um codeact which is kind of the like the react type of framework where um I I always forget like what uh what react stand for like
you see take action environment take action and then response right that's that's sort of like back and forth, right? Uh but in this case, it's like
right? Uh but in this case, it's like you think and then the action is code, right? And then you shove this into the
right? And then you shove this into the environment and then you have a response, right? So or you can do that
response, right? So or you can do that multiple time. So it's kind of similar
multiple time. So it's kind of similar to the what whatever is happening with the with the RLM um but with within the kind of the react uh framework for
agent. So the element is using code in
agent. So the element is using code in as the action instead of like doing a bunch of tool calls uh and uh just doing like JSON in just now.
Um so that's that. The main result is what we saw here, right? It's working.
So this is good. It's uh it's actually is working. So it's fantastic to uh uh
is working. So it's fantastic to uh uh to see the pricing is not too crazy stuff. Uh it depends also like it's
stuff. Uh it depends also like it's highly variable, right? So it depends on like some calls are massively more expensive than others. Um and like the
50 to 75th percentile is pretty similar all across. Uh but then the prices start
all across. Uh but then the prices start to explode with GPD5 here when the 95 95th percentile. You see here the Kodak
95th percentile. You see here the Kodak and the summary agent they are getting massive uh uh amount of um uh
multiplier. Uh while the problem is is
multiplier. Uh while the problem is is like a kind of double this one, but it's still not uh it's not not crazy high, right? It's much cheaper than this these
right? It's much cheaper than this these ones. Um and it has less biases about
ones. Um and it has less biases about like what they actually can can be doing. Uh Quen, same thing. Um but Quen
doing. Uh Quen, same thing. Um but Quen seems to be uh more uh um pricey, right?
And remember Gwen is the one that has tendency to just go ahead and not care too much about the um compute cost. Um
if we look at more details here about the exact number. So we have these benchmarks like code QA browse comp plus oolong and oolong pair right and like
the task length uh roughly right you see here br it can go up to like 11 million is kind of kind of wild here. Um but if we look
uh the base model is like is not able to do this at all right this part is just impossible for it like it's too big of context to uh to ingest so you need to have an RNS to do something so if you
put like the codeact RS or the summary RS right it can get better so you have 12% here and 30% 8% here uh it's doing
better in the code QA generally um and these ones the olong pair it's like quadratic so it's a bit bit more difficult. Uh it's doing not too bad in
difficult. Uh it's doing not too bad in the in the olong, right? But the uh RLM just the RM right the sorry the RM with
no sub so call with quentry um actually doing better uh over here on these two benchmark. Um and then um it's still
benchmark. Um and then um it's still doing uh really fine here, right? Just
with no sub call at all. So it's just one LM in a harness with the re a bit similar to Kodak than anything else when
you give it the sub agent. Um funny
enough this uh this one codec cook the the performance drops a bit. Um but here the performance increase for this one and this one and this one is fairly similar over there. Right? So Olong and
Olong pair are more like semantically kind of relevant. This is more exact type of stuff. Right? kind of makes sense, right? That like no sub call uh
sense, right? That like no sub call uh is better. So, um this kind of means
is better. So, um this kind of means that some model have some more trouble deciding when to use what, right? For
which uh type of cases. Um anyway, and in GPD5 just generally much more strong um than uh Quinn except with the no sub
call here, but generally like the performance is uh better all across and with the sub call. So um it seems to be better at like deciding when to use it
stuff. That's kind of it. And I think
stuff. That's kind of it. And I think they did an experiment where they scale number of document right uh so here you have 100 you have 10 uh thousand uh document in context and uh in this
experiment in the appendix you see like the look what happened in the degradation right because remember the comp plus the models are not able to do it like these ones the base model are
just not able to do that one uh but like if you just increase the number of document you can see that degradation say of GP5 it is just nose dive straight up Right. With 10 documents, it's good.
up Right. With 10 documents, it's good.
With uh 50 bar really and 100 is just like almost inexistent, right? Well, the
um uh the RLM is kind of doing okay all across and even better when it has all the information over here for some reason. I'm not sure what um and without
reason. I'm not sure what um and without the crazy cost. Here you see the React is kind of uh much higher at the,000. Uh
but the RLM is still relatively um uh cheap. So all of them are able to scale
cheap. So all of them are able to scale well without performance degradation and the inference cost is scaling not too bad. There's a bunch of trajectory that
bad. There's a bunch of trajectory that they put in the paper um for which are pretty cool, right? You have the this one with um GPT 5 which is like a happy path, right? You have a thousand
path, right? You have a thousand document over there and then like you can literally see what the model is doing in code because this is how it interact with like the environment. Um
and so it's you're searching for a specific keyword and it's like looking at very specific steps. You see here the window and it's trying to find snippet that makes sense. And this is I don't
know the the um uh the keyword that it's looking for, right? So it created the keywords and
right? So it created the keywords and it's looking for them u and running the red query. You have like the response,
red query. You have like the response, right? And I think this one is making
right? And I think this one is making sublm call uh to find the answer which is uh like these these this is like what
it's asking um from uh one of a dm. So
the root node is asking this for a sub agent which is extract the following from this article. what festival in town is this about and what year was this specific celebration held blah blah blah
and then it will respond all of this and then based on this he will check the information right this is what he's uh going to going to have um and then it will be able to figure out that like
yeah this is the response it's Maria Delmasio and then this is what you will output at the end right print winner first last and then first last is what
he has um so it's cool like none of this is kind um hardcoded or toolled. They
they just go and do Python.
Um over here, um is this the one the funny one? This is the funny one. We're
funny one? This is the funny one. We're
going to take a look at this one, right?
This last one, right? Um so this one cost a dollar.
This is the question and it's on the olong pair. So it's kind of difficult.
olong pair. So it's kind of difficult.
It's like you have to like double check and verify a whole bunch of stuff. Um
and then uh it says that the uh the model will begin by probing the context with various code snippets, right? It's
probing it stuff. Um and then like it decide to check semantically like they classify the data using a sub agent call. Um and uh then it goes right and
call. Um and uh then it goes right and it processed this in batch and you see it's like doing this in uh um in uh what's it like uh it created this this
uh function which is doing the LM query and it's like continuously like calling this function right and then like the recursive LM calls will do their stuff um and then the root LM will look right
if the uh whatever instance satisfy the query in this case and then has the pair and it has like the the information Right? And then it will look at it
Right? And then it will look at it and what it will it will do here um it will continuously verify its answer.
This is quen 3 cooler right. Um I think in this case it so it will repeat this process over there right and attempt again to
regenerate the answer and will do that uh five times like over and over again and I return the same answer and when it has to be like actually like um give a
final answer because it's starting to run out of context um yeah it's just going to be the root lm generating an answer out of nowhere and then just like like spitting it out which is going to
be the wrong answer.
So, uh I don't know like this this part I found it absolutely um I don't know there's it tells you a lot right like the quen 3 coder will do
that the GPT5 seems to be doing this a bit less like this kind of neurotic uh doublecheing but just the fact that it doesn't trust this stuff that it's
being produced by like itself technically like the these the sub agent um and then just verifying again and again it tells you a lot about like the benchmark, the difficulty of it, but
also like what what the LMS are actually doing when they get to um to difficult uh stuff like that, right? Um so I think there's there's
right? Um so I think there's there's maybe stuff that can happens in order to make the model and and steer the model in like being more sure about what it is uh looking at. uh but it really look
like I don't know like a student which has like a thousand document open and just double-checking and double checking and then just like not trusting itself.
Um yeah, this one was funny. there's uh
other ones um but uh you can read the paper it's all there right and it gives you a sense of what's going on and I really like that in the paper where like the quantitative information is fine
right it's all good but I really like to see the exact qualitative vibe of what how these things are running um it can by the way you can also try to run it uh
on your own right um it's just literally a harness that you put around it um and the last part that I really liked is this one is negative result. Honestly, I
think like all papers should have that.
Like just tell like you you found something cool, perfect. Just tell us what didn't work so that we don't try to do it again. Um so using the exact same system prop across all model could be problematic. So like this is the quentry
problematic. So like this is the quentry thing. They had to make this change
thing. They had to make this change otherwise like quentry was just like spinning like crazy. Um and uh like this part is interesting. So model without
sufficient coding capability struggle as RLM. So the correlary
RLM. So the correlary the opposite is also u seems to be true like model that are good at coding seems to be good at long context because they
can manipulate the pl um environment efficiently. Yeah
thinking model without sufficient output token struggle as rm that's also an interesting one. So they tested a whole
interesting one. So they tested a whole bunch. You can check out like this other
bunch. You can check out like this other um blog post uh from premitech recursive language model the paradigm of 2026 by Sebastian Mame. Um he tried a whole
Sebastian Mame. Um he tried a whole bunch of other LMS also so you can have a good kind of overview what's going on.
um RLM without asynchronous LM calls are slow, right? But like in this case like
slow, right? But like in this case like the um the calls are blocking. So like
you're doing step uh stuff step by step.
Maybe like doing it a synchronously will will will help. But then you get into the weird setup of like what happened?
How do you continue if like stuff is not done yet with another sub agent which is an interesting problem I think. Um and
uh depending on the model distinguish between the file and I thought is brittle for RM which is also uh super interesting and maybe like it it tells us a lot about like this failure cases
right because it couldn't recognize maybe that it has the final answer for real this time. So like it should just like go and commit um and it still think that it's maybe a thought and you just
need to kind of get the final answer again. Um
again. Um I don't know it's really grad student coded uh this whole interaction which is uh uh a bit fun. Um there's a bunch of limitation here that you can read in
order like what the the stuff Oh, but we're going to actually uh ask Alex a whole bunch of question. Thank you for uh coming and uh answering um answering question for
for all of us. I think uh it's really it's really cool. I really like the paper. Um, relation also like a it
paper. Um, relation also like a it remind me of the TRM actually paper like very I don't know like a very not simple in the sense that like it's nothing
impressive but simple in the sense that it was easy to follow.
>> Um so we like that. Uh to start off can you tell us a bit about yourself and like u just your your background and like research interest uh for everybody here?
>> 100%. Yeah. Um and yeah I I also I will preface by saying um hopefully the the paper was was easy to follow. I think um actually some of the earlier iterations
of the paper we had uh I I had wanted or I I I wanted to write in some like theory related motivation for like why there was actually a whole thing about
like why um certain problems are harder than others. uh which I will actually
than others. uh which I will actually talk about today but I I scrapped in the paper mainly because I think uh we didn't have enough like
uh strong evidence to support that like what I was claiming was actually true but I I think intuitively you can see a lot of it um but anyways uh so I um uh
for context I am a first year PhD student at MIT um I think it is very funny when when people say I'm an MIT researcher um I've only been there for like 3 months Uh but I um yeah, I I
graduated from my undergraduate uh at Princeton uh back in uh 2024.
Um I originally wanted to do math, funnily enough. Um and I guess like it's
funnily enough. Um and I guess like it's what the the school is known for. Um but
I I did a bunch of like really random stuff. Uh so like I
stuff. Uh so like I I I I did some like exploration with a friend into like blockchain schemes and then all this random stuff. Uh I I did a
bunch of like more core RL work um earlier on uh in my undergrad and then um eventually near the end uh I started working uh with the SweetBench team uh
on like Swebench related stuff uh mainly Sweetbench multimodal uh which came out last year. Uh and then I got really
last year. Uh and then I got really involved in GPU mode which is also really random. Uh so I I you know I I I
really random. Uh so I I you know I I I was really interested in GPU programming. Um I currently help host
programming. Um I currently help host all the competitions uh which is not at all related to RLMs. Um but the point I'm trying to make is that like uh I
think uh a lot of uh my current research interests uh and ideas are um kind of motivated by uh lots of random things
that I've uh done in the past. Um and I think the field moves super fast. So
there isn't really like one thing to be like fixated or or or focused on. Um but
yeah, currently I think um RLMs are my my main research focus. Um, and it's not necessarily long context problems, which I will also kind of elaborate on as we
go. Um, but I I think that like in
go. Um, but I I think that like in general, um, for this year, there's a lot of really really interesting research, uh, that we can do for
language models. Uh, that isn't
language models. Uh, that isn't necessarily like, um, systems work or like info work. Um, but yeah.
>> Cool. Yeah, it's really cool. And I see that you've done a whole bunch of benchmarking.
>> Yes.
>> Um.
>> Yes.
>> That's uh that's very Can you tell us a bit about like what motivated u uh that work also?
>> So I I will say I think um like I'm not going to lie and say like you know I love making benchmarks. I I don't think making benchmarks is fun. Uh you know everyone wants to do the flashy thing
right like you know you want to train a 100B model and then do all this cool stuff. Um and and so for context too the
stuff. Um and and so for context too the the benchmarks uh the ACE is referring to uh is well sweet bench multimodal was the first one with the sweet bench team.
Uh and then kernel bench uh which is like um uh LLM generated uh GPU kernels and evaluating those. Uh and then the most recent one is this thing called
video game bench uh which is like trying to evaluate uh vision language models that can play an assortment of games.
Uh, so these are like games from the '9s like Doom, uh, Mario, Kirby, uh, like these kinds of things. Um, but I I think that this uh you actually talked about
it earlier, but like this RLM work uh I think kind of highlights too that like we don't have a lot of good benchmarks.
And I I think when like the reason I ended up working on benchmarks every single time was because I wanted to work on a particular problem, but there was just no eval for it.
>> Um >> and I think like this actually is a is a really big problem in in the field in general. I think uh I I have strong
general. I think uh I I have strong opinions on like what eval should look like. Um, I think that like we kind of
like. Um, I think that like we kind of have a problem of like eval currently are not, you know, a great indicator of how good a model actually is even on the
task that you're evaluing. So like for example, you know, there's a lot of evals for math and coding. Um, but
they're not even great evals for like whether or not a model is good at math or coding. Um, so this is something
or coding. Um, so this is something that, you know, like I I think, you know, needs to change and um I I don't think a lot of people like working on eval, you know, I'm going to
be honest. I don't think it's fun, but I
be honest. I don't think it's fun, but I I think it's really important work. Um,
and uh, unfortunately, you know, I might have to come up with evals for, you know, RLMs and and and future works like this, but we'll see.
>> Yeah.
>> Yeah. I I think generally long context because because long context is worse because like you really have to think hard right and then like there's a lot of document then like what you going to do you going to
verify each one like no man so you have to kind of make it and then do it properly I mean the actually like the people that have
uh a lot of long context stuff that are relevant for LM are actually like the closed lab which they have no why would they publish their internal data, right?
So, that's that's kind of um a bit of the problem. And I realized that when I
the problem. And I realized that when I was reading the Miniax paper um text 01 uh text 01 um because um they were talking about
like this um uh uh uh like this translation long context task where there's this dead language only 100 people talk talk it and like there's
like this like this book that you give in context and then you check if they can do the translation right in paper makes a lot of sense. But then like what
happened is like at some point even the models the newer newer model are getting good at the language even without the book because they're seeing the book at
some point. So like then it's even
some point. So like then it's even harder to come up with like um good long uh long stuff. So you don't really know if the model is actually able to process the the the longer input. So when you
actually give it to your stuff it just suck. And then you're like why does it
suck. And then you're like why does it suck? it doesn't have the stuff in
suck? it doesn't have the stuff in context. It never seen that big of a
context. It never seen that big of a input and then that's that's kind of at the end of the story. There's nothing
you can do with it. Which is actually what interested me a lot with this paper is like um instead of trying to like bring in context all this information, right? And try to come up with like
right? And try to come up with like massive amount of training data for long context. Well, they can have a small
context. Well, they can have a small context. It's it's all fine. But they
context. It's it's all fine. But they
get like the tool that they need in order to kind of mine uh like literally they mine u the information a bit like a I don't know a PhD or like a graduate
student like they they don't have everything in context they have the data sitting there they have their Jupyter notebook and then they just go right and then they they they learn about and at
the end of a session the notebook they get some insight right out of it maybe it's wrong but they get some insight out of it and then they can
uh move with it.
>> Yeah. Also funny story I so in the paper uh there's only four benchmarks uh and even for like the code QA I think you mentioned it. It's part of a larger
mentioned it. It's part of a larger benchmark uh long bench v2 which I think arguably is is is maybe the current most difficult long context benchmark. Um but
the reason so we actually evaluated on a lot of benchmarks like a lot more than is shown in the paper. Uh and the problem with a lot of these benchmarks that we ended up finding was like like
either a um the RLM can just solve it basically. Uh but like we don't report
basically. Uh but like we don't report this because um the way that it solves it is by using the code environment which is like not like it doesn't need to use the sub LM calls. Uh and like
it's kind of a silly thing. So like one of one of the examples is like another benchmark or another task in that benchmark is like computing a long math sequence like a long arithmetic sequence. Um but like if you plug this
sequence. Um but like if you plug this into Python like of course you can just do it like there's no it's not a hard problem. Um but like the you know like
problem. Um but like the you know like you you'll get like GP5 can't do it.
Like you can't get everything correct.
Um and so like the the the scores you report are like oh the RLM gets 100% but GPT5 gets 0%. Uh but it's like you know that's not really an interesting result
cuz like well of course right like um and the other problem is like oh the base model can just do it like it can actually just solve the task. Um and
like the one of the issues we ran into that was like really silly. Um was we we had examples where like the task would be about like a book or something uh and
you like take the book away like I I tested this. you just remove the book uh
tested this. you just remove the book uh and the model can still solve the task.
So like it doesn't need to read the book because it knows what's what's already in there. Um so this is like it's really
in there. Um so this is like it's really really hard to find good evals. Um and I I think like you know for anyone that's interested in in doing research or
something right now, uh this is an open problem. like it it's it's genuinely
problem. like it it's it's genuinely like a lowhanging fruit of like you just need to find a good eval um that's like realistic and like um that that people you know that you want to see models
actually solve. Yeah, I think that that
actually solve. Yeah, I think that that last part is important like that you actually want the model to be able to solve because toy problems are cool, but I mean um if you want to bridge like
when you're going to actually bring the model to your to your whatever that you're building being already knowing that it can solve like like a related task is uh is already something.
>> Um okay, I think like we we we got a bit through the the uh the high level overview. Um can you talk talk to me a
overview. Um can you talk talk to me a bit like about the intuition?
>> Yeah.
>> That led to like this because you could have went if you look at the related work you could have went all sort of different ways but you decided on RPL environment. Um yeah what what triggered
environment. Um yeah what what triggered this?
>> Yeah. So this is actually a really important question. I I think um
important question. I I think um there there's like I I I don't want to say that like you know um I'm a genius like this is a
completely new idea like no one's ever thought of this. um which I I have seen online like you know there there there are there's a lot of like oh like doesn't cloud code do this uh doesn't like you know xyz method do this or like
aren't you know uh doesn't open already do this in in codeex um and I think to some extent uh yes like I I think um so
the way that this idea kind of came about um was um basically like nowadays I think models are really really Good.
Um, and and I I I want to preface by saying this because I think this is a very timely idea. So, if you tried to do this like maybe last year, um, so like we've tried this with Deep Seek R1 for
example, um, it's actually not very good at at doing this whole thing. Uh, and
like I think a lot of the fundamental model architecture/training research uh, was really really important to lead up to this point. Um so I think
the the best way to think about this is like um methods like cloud code and codeex um that do this kind of like uh very smart codebase management thing
where instead of feeding the codebase to a model uh they use like special tools.
Um this I I actually would say this started with like SU agent uh and open hands last year. Um but like this idea of not feeding in all of the information
directly to the model uh and and using specialized tools to like navigate this information. Um this idea was like very
information. Um this idea was like very exclusive to code or like even like software engineering tasks. Um and I think like the intuition came from like oh like you know a a programmer is not
going to read the whole codebase at once. Um and I think like the the core
once. Um and I think like the the core intuition behind this idea is like well you can actually do this for any task.
Um and actually uh like cloud code and and all these scaffolds like they are highly uh like specific to code like you can use them for other tasks but the models themselves are like post-trained
specifically to solve coding tasks. Um
but I think like the nuance here is like you know yes rms like the the idea is really good. it can do long context
really good. it can do long context things and this is really important like you can slot it in like you can slot in an existing model um and show that it works like I think for the purposes of the paper this is like this was the most
important experiment we wanted to show um but I think actually like more the more subtle thing here uh and like why I'm excited beyond this initial paper is
like the implication is that you can actually start post-training models uh to do this kind of paradigm Um, and this
is a lot cheaper than trying to extend the context window of a model. Um, or
like build a larger model. Um, and I think like we are starting to enter like an era where these models are actually really good. Like these transformerbased
really good. Like these transformerbased neural networks are are really really powerful. Um, but they're almost is like
powerful. Um, but they're almost is like very little need like I want to be careful with my words here because like you know we should continue to improve models. Um but it's like exponentially
models. Um but it's like exponentially more expensive to you know even like double the context window of a model. Um
and like I think the point that I'm trying to make here is like you know they are already so powerful for you know transforming text of a certain context window size to uh text of a
certain context window size. Um but you can actually like chain these things together uh and and you can produce a significantly more interesting system by doing this uh without you know incurring
really really expensive uh scaling costs. So in my mind like this is
costs. So in my mind like this is another axis of scale uh that is like very interesting. Um now we don't talk
very interesting. Um now we don't talk about this a lot in the paper because as a paper we can't claim things that we can't prove. Um but I think like this is
can't prove. Um but I think like this is like a really important point. I I think this is actually why for example like prime intellect is really interested in this approach. Um it's not necessarily
this approach. Um it's not necessarily just for the long context part. I think
there is like a piece of it that's like really really important which is you know maybe all future models are actually um interacting with your context in this way. So yeah.
>> Yeah. Like for me what what what uh sparked my attention here is that it has
a similar um uh shape in term of like um uh like just a the thinking and then train of thought um kind of trick,
right? Because it's simple enough. It
right? Because it's simple enough. It
works all across the board, but then you can actually it so that like it's even better, right? And we're not we haven't
better, right? And we're not we haven't done this yet, but I think like it will have as it has the same characteristic
um as as this setup. Um and the other thing that I think is yes coding is is economically super important I think for like just the whole field because if it
start to work even better on bigger codebase with just like a simple framework and everybody can do it then there's going to be more tools that will be able to work on bigger code bases and
then like you get into enterprise and like the money is there right this I understand but like in my view it unlocked the whole like scientific
speific uh research agent type stuff that is able to go for longer uh like amount of time >> and is digging into this because
fundamentally this is a grad student like like that's the feeling I had is like hey I have this big task and it's complicated it's super long there's a
lot of this stuff that I need to to take a look at but that's so I'm going to like parse through it a bit figure out like hey this is relevant this is less and I'm going to give it a task for next
week and my task for next week is like try to figure this chunk and then you go and you figure this chunk and then you get it out right and then you work on it and then but this flow and then literally it's in almost a notebook
right Jupyter notebook um like this flow leads to like some amount of facts or like information being found that then
can be used to answer a bigger question right um this I really liked I I found that like it has these u this uh the setup. The other stuff that I really
setup. The other stuff that I really like and I want to like check with you like was this on purpose or not is that it's very minim minimalistic like there's no massive RS of any like very
bell whistle. It's like a system prompt
bell whistle. It's like a system prompt and then a RPL. Uh was this like done on purpose or did you try a whole bunch of stuff before getting to that?
>> Yeah, so this was intentional but we also tried a lot of stuff before that.
Um, so I think like what we ended up settling on was um like I I I don't love calling this a
scaffold because it is it is like I I I want to be clear like it it totally is a scaffold. Um but I think uh scaffold
scaffold. Um but I think uh scaffold makes it sound like like this is a type of like a new type of agent that we're building that we want you to use. Um I I
think more fundamentally like what this is is like a very particular way to do model inference. Um and like keeping it
model inference. Um and like keeping it as minimalistic as possible is is very important for this because you like you know you can't afford to train a model
just to be used as an agent unless you're like anthropic and your goal is to sell like cloud code. Um, I think if we're more interested in like general model capabilities, like we want
something like as, you know, thin as possible on top of the model. Um, and I think that's yeah, that was like kind of the guiding principle for for how we designed this.
>> That makes a a whole bunch of sense. Um,
wait, there's a question here.
>> Yeah.
>> Um, just wondering are there a problem where the context expand instead of contracting at the meta which you might want to all the LM connects can't fit in? I there was a similar question
in? I there was a similar question earlier about like what if the context window of one of the models fills up like one of the intermediate models. Um
yeah I mean this is a this is definitely a concern. I think like in the long term
a concern. I think like in the long term like the hope here is that or one of the the core ideas of an RLM is that no single model call should ever exceed a
certain length. Like this is kind of
certain length. Like this is kind of like the this is the hope right now.
Like how you actually guarantee this is not super easy. I think what we found in our experiments is that like if we just let this thing run as is, it actually just never fills up. Like it never even
gets close to filling up. Uh but you can imagine if you go to a like a harder and harder task, maybe it does fill up. Um I
think what like ultimately what this should look like like in in its like full form is like a recursive language model will be spawning another recursive language model. Um and like in this
language model. Um and like in this sense like the actual intermediate like model calls will never exceed a certain length. Um yeah I I you also could
length. Um yeah I I you also could implement like tricks like compaction and stuff that like cloud code currently do. Um I don't love this because I think
do. Um I don't love this because I think it kind of takes away from the core idea >> which is that like >> this entire process should be a quote
unquote like no information loss process. And and what I mean by that is
process. And and what I mean by that is like at no point like the reason why we store everything in the ripple is um the model should technically like in theory
have access to all information like in its purest form um not in a compressed way. Um and
way. Um and >> you you kind of want to maintain that like throughout the trajectory of of the model as well. Um but yeah I I I think
like the sorry the short answer is like we haven't run into this issue in in like the current experiments we've run but this is totally like a plausible issue. Um and I think like it does get
issue. Um and I think like it does get solved by like deeper and deeper recursion. Um because like the idea is
recursion. Um because like the idea is you keep splitting up the context. Uh
but I think like we don't have a a strong like robust uh guarantee here. Um
but like maybe that's a future work kind of thing as well.
And like um in my view like compaction is kind of interesting in the sense that like if like you're working toward the
specific goal and then the compaction kind of makes sense, right? But like um I I come back to this grad student type of workflow. Imagine if like the grad
of workflow. Imagine if like the grad student was cleaning the data right in a specific way and then throwing the raw data right like it will be kicked out of
the lab ASAP right you you do this and then you you messed up I don't know the filtering or whatever it is >> well congratulation right like what what
the heck are we going to do now right >> um and I think this is um because like I always come back to the scientific discovery type shape because I think
this has a high potential in this this realm because like you have the thing in the raw form and then you're digging through it, right? Um and I think um
also what Pataki is um hinting toward is that like there there's there's these experiments now where like for scientific discovery where the model is running for like days and whatever it
is, right? at some point like it's
is, right? at some point like it's creating a lot of context because like let's say the ERPL is just like it's just uh like literally a 100 meter long
in term of a of a back and forth. Um at
some theoretical point it should fill up the whole context even if the output should be like one answer a name or whatever it is. Um, but I think like um
uh uh having the raw data, having the thing digging through it um is already like a big piece of the puzzle of like being able to generate like insight and
facts um out of it. I have a literally a dumb uh proposition here which like amounts to like literally jamming my SQL database in there and then storing a
bunch of fact. Um, yeah, I mean, you already have RPL in there. Like, why not adding a MySQL instance and storing a bunch of stuff. Um, well, we're going to get to this. Um, before we dive into
that, um, I just want to get your raw thought here about like, um, long context in the model, right? Um, I've
done a video on this and I've been digging through through this. It's just
like literally a hard problem because you always get into like this kind of uh uh trade-off, right? Um and to make it good, let's say like let's say you linearize the attention or whatever the
heck you do block stuff, you do lightning attention, you do all sort of weird like wacky thing and then you optimize GPUs in order to make it work so that it's actually faster, right?
Like do you think there's there's something there to juice up still or like um it's just the wrong way of thinking about like the the long context problem?
>> Yeah. So um
I think it is the wrong way to think about it and I'll explain why. So I
think um scaling like the context window of a language model has two main challenges. Uh the first one is the
challenges. Uh the first one is the systems challenge, the systems level challenge, right? Like oh attention I
challenge, right? Like oh attention I mean attention is actually not even uh the issue usually. Um but like let's say like you know attention is quadratic um maybe you can use linear attention uh
sliding window uh stuff like this um like and and you know maybe you need more GPUs to train a larger model uh etc etc like you know maybe you're 10xing the cost of your training run. Um this
is definitely a challenge and I I I think like my take is that if this was the only challenge um we would be able to continue improving the models uh
significantly like we would be able to extend the context window um beyond like much further beyond what we currently have. Uh this I I
have. Uh this I I um I believe uh and I I I think if you ask anyone in like the systems community I think they would likely agree. Um I
think uh maybe not maybe maybe there's some like you know really strange reason why you can't do this but uh in its current form like you know just scaling compute and scaling model size is is not
like it is just purely a cost issue like I I don't think we've we've hit that wall yet. Um, I think actually the more
wall yet. Um, I think actually the more subtle issue uh is is the data and and this is this is a core reason why I
think RLMs are are so cool because um I think that we we often take for granted that the way that we've trained language
models um is effectively using like the internet or or using naturally occurring
language um and kind of learning this distribution. Um but this like naturally
distribution. Um but this like naturally occurring language distribution uh is not like un like unbounded in length. Um
like I think the the the the sequences that we observe uh like in the wild uh tend to be like distributed according to some some like mean length and and and
variance, right? Um and I I think like
variance, right? Um and I I think like we have gotten away with the fact that like language models continue to improve because uh we have these like naturally
occurring sequences uh that capture the distribution that we want. Um I think the way that we have done longer and longer context things uh is we have to
generate synthetic sequences like synthetically long sequences and train on these sequences. Uh the problem though is that like it it's not fully
convincing to me that like doing this will get you will like net you any longer term benefit. And I think the
greatest example of this is uh the like practical failure of reasoning. So like
reasoning models are really good, right?
Like don't get me wrong, they're they're amazing. It's it was like a great
amazing. It's it was like a great breakthrough of of the past year. But
there's like a lot of papers that have come out recently that basically show or like you know experimentally that um reasoning is this really silly thing because the actual content of the
reasoning trace is like almost irrelevant to the final answer. Uh and
like part of the reason that this is happening is like um at that scale like at this like sequence like I think uh the the way to think about this is like as you get longer and longer sequences
you need exponentially more data to like fit a proper distribution right like this there's like an entropy argument here um and I think what's going on is like with these reasoning like these
long reasoning chains what we've kind of observed is like well the the the the good part of a long reasoning chain is that it conditions your model well to get like the right output. Like you can think of it as like a way to pick out
the correct outputs from like what you actually want. And reasoning we we've
actually want. And reasoning we we've kind of seen as like a way to do this, but what ends up happening is like you know with the Quinn experiment, we saw this with the RLVR stuff like oh you can actually just post train on random stuff
and like you can still get like some good answer. This is a really odd thing.
good answer. This is a really odd thing.
Um but the beauty of like the RLM part is like we can actually keep the language model input and output distribution uh within a length that is
actually naturally occurring. This is
like a like it's a it's a weird thing to to like wrap your head around. Um but I think like this is the the whole context raw thing we've been observing where like you know you you make the sequence like like really really long and all of
a sudden the model performance like just tanks and and it's like why? Um, and and part of the reason is like I think we shouldn't, you know, it it's it's I I
don't love thinking of models like in an anthropomorphized way. So like like a a
anthropomorphized way. So like like a a human obviously would not make these kinds of mistakes, but I I also think that like uh a human learns in a very different way. Uh and and so like we
different way. Uh and and so like we should think of these models like yes, they're very impressive. they generalize
well all these things but at the end of the day like we we do have to think in a very mathematically principled way of like how were these trained and like what are they doing uh and I think if you think this way like it's very
obvious that long like just the transformer model and and training it on long like huge uh context sequences uh is a is a really like difficult thing to
do um and I think like with RLMs the idea is like we actually don't have to do this we we can do long context things without training them in a long context way. Uh and and there's a lot of
way. Uh and and there's a lot of benefits I think for doing this. But
yeah, that's that's my my my take.
>> Yeah. No, I agree. I agree here. And um
my my my other um view of of long context like if you want to kind of go and play around in the internal of attention or like how everything is set up. The issue also is that theoretically
up. The issue also is that theoretically it's better. It's faster.
it's better. It's faster.
>> Yes. But practically when you look at the practice like flash attention v whatever is much better in all metric because it's optimized for the GPUs. So
then you get like these theoretically fantastic um like advancement that are absolutely dinky and worthless in in in practice
and then nobody is going to like do the work of making the GPU stuff that you need in order to use them efficiently.
So then it's like why did we do even do this stuff and I think that if if there is a method that can sight step this like for sure we're not going to go
forward into the into the long context like the slice of what they see is enough and um also reasoning for me is just like um I I see it in two ways
right uh the first way I seeing it is like um it just helped the model like uh filter out like I don't know wrong kind of answers and then just like bias it
toward where it should the answer should be because technically you could just inference scale first shot it and it will output maybe in this mess the right answer and if you were able to pluck it out you would you would be able to do it
but you could do it like either way but also if you look at it from a like uh an activation like perspective right you get kind of these trace of like
activation that are happening in the model because the model doesn't have a state so you have this activation and then this activation will be uh with another input will be able to give you this other activation and and then at at
some point it get the right shape in order to do the right activation for the um for the stuff. Um so like reasoning is nice but the fact that it's outside
is a bit kind of uh weird and like as it goes it just is like using like the the useful context inside. Wait, there's a a bunch of question. Um, there's somebody
says you mentioned cloud code and other agent already do atent context retrieval management to some extent and the exciting part is perhap more on the post training. Um, so this is a question
training. Um, so this is a question about like uh like is the interesting part of the RLM the RL that we can do on it? Um
it? Um >> um >> what's your take on that? Yeah. So I I I I want to be careful here because I think like um well I'm getting used to
this too as a new PhD student of of what I can claim and what I cannot claim. Um
I I think the interesting part of the RLM paper uh is still probably the main result which is that like hey like you
know first of all like uh one of the things that we wanted to solidify in this paper which I wanted to do in math but I think maybe the best way to do it is just in plain English. Um long
context tasks are not equal. I I think this is one thing like I don't know why this wasn't made clear in the past but like obviously needle in the haststack is a very easy thing to do. Um but like
if you were given a really dense like u long context uh it's really hard to process this thing. Um and the main contribution is like even without training like even with no training um
you can do this really really simple like task agnostic method uh and you can take current models and you can scale their performance uh on on really really long sequences uh and you and it can
actually process these really dense and also really sparse uh inputs very well.
I think like this is the uh you know we have the results and I I think this is like as on its own like this is already a really cool result like I think if there was no RL like no future thinking thing uh this is a cool paper like I I'm
very happy to publish something like this. Um I think the RL part though or
this. Um I think the RL part though or like the the training part is more like why do I think this paper is more interesting like beyond this like for
this year right so like um like why is my research still focused on RLMs even after I've come out with this paper and and probably we'll we'll try to publish it somewhere and I think that's where like all this other speculation and and
all these things come um and like I think a lot of it is grounded in intuition like I think a lot of people are also seeing that this is um a really interesting ing bet to make uh similar to like chain of thought and and some of
the the other things that have worked in the past. Um but yeah.
the past. Um but yeah.
>> Yeah. And like this paper also remind me of like um I think it's the meta uh paper where they jam like a coding environment into the world model of the
thing and it got better >> and stuff. Uh just want to show this. I
I knew that the this course of um I still want to dunk on anybody that is working on at Meta, right? But like this
um like at this point when I saw Lama for scout like this whole block of of needle and a stack to 10 million token,
I knew this this was absolutely worthless. Like it's not it's not the
worthless. Like it's not it's not the thing. like it's not just like a needle
thing. like it's not just like a needle and a stack problem. It's much more complicated than that. And like I think you put it well and I would like have light like that like this theory was
pushed a bit even further um about like um uh this kind of angle between the long the the size of the context and the
difficulty of the task. I think this is something that in generally in the discourse is not well um well explained and I think like the other part is also like
um I don't know like a um the average uh capacity like the average um useful window size of the model like if we have these three kind of axis then it's a bit
easier to say like roughly speaking for this specific task how hard it will be for this specific model to be able to to interact with it. Um, okay, cool. I had
a question about um, we can maybe dive a bit more into the RLM structure because I think there's a whole bunch of question about it. Um, so my first question is that you chose our RPL uh,
for this, right? Uh, which I think makes a lot of sense, but you did ablation on the sub agent, right? So you remove the sub agent. Do you think you can do the
sub agent. Do you think you can do the ablation on the other way around where there is no RAPL and it's just like a whole bunch of sub agent that are just
working on the on the context without having it fully loaded. I think this is this part I think was uh maybe missing in the sense that like you're not you don't have to load the full thing and then like then they just go and do their
stuff. It's just like it's still an
stuff. It's just like it's still an environment variable somewhere but they're not writing code. They're just
like working on it. Yeah.
>> Are you Did you think about this or this is is useless when you look at >> No. Um so this is something that we that
>> No. Um so this is something that we that we missed and actually we're running it right now. So I I'm there's two new
right now. So I I'm there's two new things I'm adding to the paper. Um which
I'm not really going to make a big announcement about. I I think it's uh
announcement about. I I think it's uh more for um also we're we're like submitting it um to a conference. But
one of them is that it's exactly that um we we we need a baseline that is effectively like um can you take react or or codeact or something and give it
sub aents uh but you you don't you don't do this like offloading into a ripple kind of thing. Um, and the the point uh the point of showing this is like
actually for RLMs like I I think another thing I want to be clear on with with the RLM thing is um you know the idea of
taking a model and giving it access to subm models uh is not new. We're we're
not the first people to do it. Um
there's a few other works that have tried this. Obviously Claude Code in
tried this. Obviously Claude Code in intrinsically does this. I I think um there's an argument to be made that like I I think the sub agent way of doing it
will get phased out. Um what I mean by this is like the way they do sub agents is like you define the sub agent and then like claude code will be smart about using it. I think ultimately like in the long run this will just be
completely removed and cloud code will decide what sub aents it wants to use.
Um, but even ignoring that, I think the the thing I want to be clear about is like the there's two key parts of the RLM. One of them obviously is the
RLM. One of them obviously is the recursion part. Um, but the second one
recursion part. Um, but the second one is like how do you actually do the recursion? This is like a nonobvious
recursion? This is like a nonobvious thing and I think the ripple is one way to do it. Like I I think the and and there have been some other proposed ways I've seen online uh like using a file
system uh and and bash commands also great. Like I think um the reason we
great. Like I think um the reason we chose the ripple um is like was mentioned earlier like these models were pretty good at coding um and I think
like um maybe claude or maybe opus 4.5 can also do uh file system management really well um and that's great and we we we should definitely try it like I think um that's one of the things we
want to implement in the open source library if people want to uh like use it but um the ripple I think is like the most intuitive way like you know I think code Python is really easy to read uh
and it's like really easy to like say something in English and and and you know write it out in Python. Um and so yeah, this baseline is very important. I
I think we're currently running it. Um
it's the results are probably likely what you'd expect. Um but I I would say the main thing is like this setup cannot handle long context um for obvious reasons, right? Like it still has to
reasons, right? Like it still has to ingest the full prompt. Um but yeah.
Yeah.
>> Okay, that makes a lot of sense. Uh we
have a bunch of question about sub agent. Well my first one is like does
agent. Well my first one is like does does the like two ways does the model know how many sub aent it is spawning and does the sub aent knows that it's a
sub agent or just running on a task?
>> Yeah so in the current setup no we actually provide as little information as possible. Um the reason being like we
as possible. Um the reason being like we want it to be if it can work without this information like that's great because uh people can experiment with this if maybe like tune this if if they
think it'll work better. Um the model I mean the model implicitly knows how many sub aents it's spawning because the code it generates like should tell it what
like how many it's spawning. Um, but I think the Quen 3, you know, experiments clearly show like, well, it maybe doesn't have the greatest grasp of of how like how many, especially if it writes a for loop, right? Like it writes
a for loop over this long.
>> I think Yeah, I think this is it. And
also like I think like it might get confused because it also is writing a for loop and encapsulating the LM query inside a function call.
>> Now you have like multiple layer of ab obstruction about what the heck you're doing. Poor dude is confused out of his
doing. Poor dude is confused out of his mind.
>> Yeah. So, I I think like a lot of these things can get baked in. Um, a lot of these things can also get post-trained out to be honest. Um, and as for if the sub agent knows that it's a sub agent,
uh, I actually think it shouldn't know.
And, and the reason like we we did it this way is like the a big part of like the thesis here is you can use an RLM on a single model. Like, so yes, you can
use an RLM to spawn like like you can use GPT5 to spawn Gemini uh 3. That's
fine. Like that's that's great. And I
think actually that's likely what it will look like for for maybe the next few months, but ultimately I think what we really want is a single model that acts as both a regular model, like it
should still be a regular model, but it should also be able to be used as an RLM. Um, and when it spawns itself as a
RLM. Um, and when it spawns itself as a sub agent, it should treat it like it's a regular call. It it it should have no it shouldn't need the prior that it's a sub agent. Like it's just being asked a
sub agent. Like it's just being asked a question and it has to answer it. So I
think like this is going to be a key thing like moving forward like how do you train an RLM such that it still maintains its performance as a regular model but it also has the ability to be an RLM. These are this is like an
an RLM. These are this is like an interesting thing.
>> Yeah. Like I'm saying this because like um for the quint tree like in order for it to work what happened is that you had to literally tell it in the system prompt that like >> my guy just watch out for the compute
cost because like this is too much right but I think like um it knowing like how deep it is right now in in sub agent call uh gave it this information without
having to tweak the system prompt.
Right. So you get one clean system that just worked everywhere and then like you're just like giving this model information about implicitly like the
number of agent that is spawning is u is correlated with the input cost. It's
like 50,000 like 50 agent deep.
>> Yeah. it it should it should know at this point that like it's messing up and it also it was I was reading the permanent blog and uh they gave it hint
>> also on the difficulty. I think this is also another part that um is super important because they seems to be kind of poor at assessing >> roughly speaking what is the difficulty
of this task right >> um and like how much they should maybe use in some of compute in order to solve these task uh so like that's that was
what prompted this thought the other thought is that if you we want to like do kind of like recursive and recursive sub agent calls right um in my view is
that there like the the axis that you were talking about about like how long the input is and like how hard the the thing is should maybe be something that
the model knows about, right? Like u
look how much I spent right now and um look at what I'm giving to you. You're
like I don't know agent number 52 and you're three layer deep right now but you have a easy task, right? This task
is supposed to be easy. Yeah.
>> I can just solve it. This is supposed to be easy for my breed of allies, right?
>> Um I think like like the the part that is hard here >> is to like be able to implicitly know
how hard the task is. Um and uh yeah but I think I think if you if you're able to give it it should kind of direct the LM normally like you see GPT5 seems to have
a better reasoning about >> compute cost and just like how much thing are are hard or not right Q3s have no absolutely no idea >> uh about about that stuff.
>> Yeah this is a great point. Yeah, I I think uh honestly uh this is something that should be experimented with like I I I I think what you're saying like
could very well be actually the way that that it's done in the end. So
>> and um uh we have um information here a question that like fit with the one I want to ask is that um in the paper you said the system prompt is fixed across all experiment so the sub agent doesn't
even know the system prompt right like it doesn't know that it's like in RLM type of stuff okay cool so that's >> because we're doing depth equals one the sub agents are just models they're space
models uh and they're >> you need to like fix one specific task and that's it.
>> Um uh and it doesn't have access to its ancestor tree. It doesn't know anything
ancestor tree. It doesn't know anything about like uh that stuff. Okay.
>> In this case, I I think it's because the tree is not interesting. I mean, it's it's just a root and then a bunch of leaves. Uh I I think if you start to
leaves. Uh I I think if you start to think about more like uh higher recursion depths um yes then I think actually maybe we we should start thinking about like telling it where on
the tree it is like what like maybe give it a little bit more context about its parent node. I I think that's because I
parent node. I I think that's because I go back to the lab analogy, right? which
is like if somebody is getting handed like this project like hey can you do this and you the grad student is like okay and then he's take it look at it and say and give it to another one like hey dude can you do this and then just
give it like the actual thing as is um without knowing that it's been the sixth guy that has been and then this stuff um that's one thing the other thing is that
if it know it's easy right it is like arguably easy the chances that it will go and actually do do it are a bit higher. I' I've pulled out a
bunch of um uh neuro inspired research here on that like it do change the behavior of human when you know the difficulty of the test. If you know it ahead of time, right, the chances that
you're going to do great on the exam is is really high.
>> If you know it straight when you have to do it and you had no time to for preparation, whatever it is, um like then it's different. It depends on if you're anxious type or not anxious. In
this case, I think Quentry is anxious and GPT is like pretty chill.
>> Um, but like it has an organic human type of intelligence and impact and also found some paper that show that like it also has an impact on LM's uh ability.
It is just not that great to assess >> um like the the complexity of the task.
>> This is this is interesting. Yeah. Um I
the last thing I'll say is like I think there this is for the more theoretically inclined people um this is a really interesting problem actually of like
local and global um observations. So if
if if like um well I don't know how related this is to like uh uh palm dps like in RL but in general like what we're dealing with here is like this
system where not every model or like not every actor in this system has all information about what's going on. Uh
which is important right cuz I I think the thesis here is that it can't. Um but
I like there is sort of um maybe some things to be said about how much information should you give each of the models at every layer like there
there there is likely a way to like characterize this very well but anyways not that important. Yeah.
>> Yeah. But no I think it's actually super duper important especially like if you're if we're thinking about asynchronous versus synchronous. Um
>> yes >> like in the in the asynchronous case I think it doesn't matter too much because you just go and then like and it just it just do it stuff and that's it.
>> But in the synchronous case I think there is a chunk that is missing which is the where we store all of the context or the facts or like whatever it is like that we're like directly mining right
now. Mhm.
now. Mhm.
>> Um and then this is being used to kind of double check facts or like um in some shape or form kind of align like the rest of the of the model behavior. Okay.
I had a I had a question about um yeah this the the hardness question. I think
we we touched already upon it. Um like
uh what's your like raw intuition here about like why quen 3 coder is like making so many calls like a sub agent like I think I I've said a whole bunch
of stuff but like roughly speaking because it's still big it's like a 400 >> something uh B model. Uh what's your
take here? I don't have a a fully like
take here? I don't have a a fully like principled uh um way to answer this but I I think like we have seen that some of
these models like Quen 3 especially um is like a heavily benchmark maxing post train model. Uh, and I I think like as
train model. Uh, and I I think like as much as you know we like to make fun of Open AI and all these companies like I I I think Chad GPT or like GT5 and and Claude and and these models or like
Gemini tend to be I I think pretty good at like even newer tasks that it hasn't really seen before. Um, they tend to make just more principled decisions. Um,
I think Quen 3 coder is just a case of like it just isn't like it's not explicitly trained to do this kind of thing and so it makes very poor decisions. Um, yeah, that I mean that
decisions. Um, yeah, that I mean that that's my speculation. I I I don't know.
It it also could have like do with the task that that it has been trained on in the past. Um, and like maybe it's used
the past. Um, and like maybe it's used to just spamming. I I don't know. But um
>> Right. Right. Right.
>> Because like I think this is pretty important because if they're kind of like fried up with RL um and we need to RL some more, right? like there's may
like this may need to be happening um a bit earlier on in the in the post training of the of the model if like we need to RL >> the model on the RLM
>> um also just rough intuition here like what do you think all of the model are repeatedly verifying um their information because like this is
something else like okay spelling sub agent is one thing but then you have the answers but You're verifying it again and again. Yeah.
and again. Yeah.
>> Right. Like why this is you think?
>> Yeah. So I think based on my experience using even like
coding IDs, I think a lot of models or I I think trajectories are a very unnatural form of text. Like if you
think about like like a trajectory is like a concatenated sequence of like input outputs from a model uh which doesn't these haven't really occurred
until recently like this wasn't like a natural thing that you would really find um on the internet for example. Um, and
like one of the things that is like kind of frustrating, like I I don't know, I still don't like fully know a way around this other than maybe post training is
when these models like so if the model uh the one thing I've observed is like if the model comes up with the answer really quickly and the trajectory is really small um the model tends to just
finish it right there. Like it tends to just be like I'm done. like there's
there's nothing here. And I I think again uh I I don't want to like anthropomorphize uh this argument because I I I don't think I don't think the the argument I'm I'm making is not that as the sequence gets longer, the
model is more like uncertain or something. I think what it really boils
something. I think what it really boils down to is like when the sequence is really long, um the models make suboptimal decisions. Like they they
suboptimal decisions. Like they they just don't like they're just not very good in this setting. I I think we've seen this in the past with like the the jokes we've made with with cursor like when you have a really long uh history
it just starts to make really odd decisions and I think this is like a similar thing here where like um basically what is happening is uh for
whatever reason like the high probability actions it should take is just to like retry what it just did and verify that it's correct and it gets stuck in this loop. I I think Quen 3 is
is the biggest offender of this. Um, but
this is like a known issue with Quen 3, right? Like it it tends to repeat
right? Like it it tends to repeat things. Um, but it actually does happen
things. Um, but it actually does happen with GPD5 as well. Uh, and I I think like this actually goes back to the the issue before about like training one context things. Like I I think even at
context things. Like I I think even at this smaller context window, like maybe like 100K or like 50K, it still makes these really silly decisions. Um, so
yeah, like I I um I I would say it's probably a training issue. Um
>> Yeah.
>> Yeah.
Like in my mind also it might have to do with like the fact that these are stateless and like you said like it's not like it's it's not moving them enough out of their this the
distribution of like I'm going to have to retry this again right but for us that have states what we're seeing is that you dumb like piece of trash it's
been four time already like it's it's enough like it most likely is good but um it's not enough. It's like, no, this is still uncertain, but like the fact that you're trying it for time should
make you more certain that like this thing is most likely okay. Um, which
bring uh it back to my idea of like this kind of fact type database, right? Like
you generated this fact and and two two other sub agent generated the same exact fact. You did a different trajectory,
fact. You did a different trajectory, but the fact is the same.
>> Mhm. like theoretically speaking you should take this into consideration right and then like like putting in there in some shape or form um
>> I wanted to um ask about like the RPL flow because um from what I understood it it's not Jupiter like it's literally
or just straight up RPL like where in order to to output some text you need to print it.
>> Yes.
>> Am I am I correct here?
>> Exactly. So, have you thought about like leaving it like room um I don't know to like write markdown or something like
that? Um because I come back to the the
that? Um because I come back to the the grad student thing like if I had no room in in in my flow in order to write my thought
>> about what I just saw. Um it kind of is a bit limiting like yes I'm going to like just engineer and do stuff >> but at some point like I'm not going to write print statement that that has my thought in it. I would much rather just
switch to markdown and just start to write it out like what's your uh general interest here?
>> So the interface that the model interacts with is super important. So
like if it if it's a ripple or if it's a a notebook with markdown and it can also like plot stuff like this is all really
important. Um I think though the the
important. Um I think though the the caveat here is that technically with the ripple environment it is able to
represent almost anything that the Jupyter notebook can. So like for example if you want to store markdown you could store it in a variable. It's
silly like it's it's not a natural thing to do but it can do this. And I think when when developing the paper, we decided like we want it to be as simple as possible. Ripple is like the simplest
as possible. Ripple is like the simplest possible thing. Um let's just stick with
possible thing. Um let's just stick with that. But I think in the long term like
that. But I think in the long term like you know let's say people who want to use this in production or or like want to squeeze out performance um yeah it's a good idea. Like I I think um storing
stuff uh in like a Jupyter style thing.
Uh there's another advantage of using a Jupyter notebook. Um, and also the
Jupyter notebook. Um, and also the reason we didn't use Jupyter notebook is I think the ripple is just like really easy to set up. Uh, like in you know Jupyter notebook you have to do a bunch of stuff. Like if you were to write a
of stuff. Like if you were to write a library for this it's it's a little bit nasty. Um, but the the other advantage
nasty. Um, but the the other advantage is like you know in a Jupyter notebook you can print out images, you can like plot things. Um, and there was this
plot things. Um, and there was this question earlier um about like multimodal stuff. Uh, and the answer is
multimodal stuff. Uh, and the answer is like yes, you can do multimodal stuff actually. Um the the problem in its
actually. Um the the problem in its current form is like we pass everything around as text like we have no way of passing around images but it's a really easy change like in the code. Um and I think one of the things again open
research thing if people are interested is like multimodal RLMs or like looking at RLMs in multimodal settings. Uh the
the reason this is so even even more cool is like I think code interacting with image stuff is like a very underexplored thing. uh and like how a
underexplored thing. uh and like how a model can interact with image stuff um or even like non-image stuff but like generating images like plots and stuff
and and using this like uh GBD5 or they they have some tools that let it do this um but I I think like doing this in a more principled way is is a super interesting topic. Um I am planning on
interesting topic. Um I am planning on adding support for this in the RLM library. Um if anyone also wants to add
library. Um if anyone also wants to add it themselves feel free open. Um this is like a Q but yeah uh I mean the representation matters a ton and this comes to like uh what is considered in
distribution what is not like these are all important things right so >> yeah that was my thought because like um because I have had this kind of grad student image in my mind
>> my thought was that like okay what does the data look in distribution and data look in the distribution for like analysis like this on long stuff um
where You have to do some sort of like semi- mini analysis about like the things. It look like you have to look at
things. It look like you have to look at like uh programmatically interact with the the substrate but then you have to think about it right and say you're taught right and then these this become
kind of the anchor uh that you use for the next step and then like you do and then you keep on doing this uh so that like when somebody is going to like you're going to hand this to to somebody you're just going to read like the your
thought and then that's kind of it's kind of summarization of all the code >> um that is happening. Um but I mean like um inherently the models have seen these
Jupiter and seen this structure. So like
maybe they are going to be more u uh pushed toward like the same kind of u analysis behavior by being able to do
that. I also like saw some research from
that. I also like saw some research from um Microsoft I think uh these guys um enhancing LM data analysis capability with notebook and inference time value.
they're doing like a Monte Carlo search stuff.
>> Um, but basically they're just literally trained the model um to uh do data analysis uh with Jupiter style uh
tooling and it seems to be working well.
Um so I don't know it just sparked this um this thought um in general. Um yeah.
Okay. So um you you did some abl ab ablation without um sub call like the in some cases the the the the RLM without
sub call is able to perform better than one with like that can't do sub call. So
it's like it what's the what's the issue here is like the RM doesn't really know when it should be doing one or the other. Like uh what do you think?
other. Like uh what do you think?
>> Yeah. Um I think it's a mix of things.
One of them being of course like you said um it makes suboptimal decisions uh with the the recursive sub call. This is
also another reason why it's very important to add the baseline that you talked about before um like in the final version of the paper because we do want to see like if you strip out the two most important parts or like
independently strip out one of them like what happens. Um, I think the the one of
what happens. Um, I think the the one of the big points of that ablation is that like a really important part of this paper is not actually the recursion
part, which is funny cuz that's the name, but it is really about like this offloading the the context uh somewhere else. That is really really important.
else. That is really really important.
Um, I think another big part sadly is the noise. I like I I the annoying thing
the noise. I like I I the annoying thing is like um you know I I will always criticize my own papers in this way which is that I don't have standard
deviation bars and and and stuff like this. Sadly I just cannot afford to run
this. Sadly I just cannot afford to run you know like with them. Um and so like I I think in a lot of these instances it it likely is also due to some kind of
noise. Uh which applies even to like
noise. Uh which applies even to like comparing to the other baselines as well. Um but I think like generally yeah
well. Um but I think like generally yeah sub-optimal decision-m uh noise and then also the fact that like on the
benchmarks where um it does perform worse. Uh these are ones where it
worse. Uh these are ones where it actually can kind of get away with not using the recursive calls because those tasks are not very information dense.
>> Um true. So it can just find the thing it needs and then the main model can reason through what that information is.
Like it doesn't need to do the sub calls. Um so that's another explanation
calls. Um so that's another explanation for like why there's a big a bigger gap for the other um for like Ulong and and Ulong Paris. But yeah,
Ulong Paris. But yeah, >> that makes a lot of sense.
>> Yeah.
>> Okay. So um I'm going to spare you the database question. Um
database question. Um >> I mean if you ask me I'm happy to answer too >> but but I mean like um um you just need to test it. I mean like uh
how can we know?
>> Um and I think this is like also adding some complexity to the system which like I think it it align with the other question which is do you think this
could act as a replacement for like a fullblown rag system like uh if we push it to the extreme here?
>> I don't think so.
The reason I say this is the usefulness of rag and and other retrieval methods is that or a big part of them is that you pre-index stuff like you you
pre-index like the things you're searching for which is not cheap, right?
Like it's and and it's a big reason why we actually don't compare to rag. I mean
also in our in our baselines rag just doesn't even make sense. like the only the only setting where it makes sense is a browse comp uh plus, but in their paper they actually do rag and it doesn't do that well compared to uh
BM25. Like it just wasn't even worth
BM25. Like it just wasn't even worth doing. Um but I I think like um I I
doing. Um but I I think like um I I still think there's value in methods that pre-index stuff. There is also value in equipping RLMs with tool calls
and also equipping them with with rag as like a as as an extra thing. Um and I yeah in that way I think like rag is still an important thing or just
retrieval methods in general are are are still like very relevant um in specific settings right so like in settings though where I think where RLMs really shine is where like you cannot afford to
pre-index or like you're just given something new on the spot which often happens for like a long agentic trajectory or something. Uh but I I one of the things I do want to explore in
the future is like a task where like the long context part actually doesn't come from the prompt. It comes from the trajectory itself. So you you can
trajectory itself. So you you can imagine like let's say we have a really really hard rag problem like a really hard retrieval problem where you need to like piece together everything. I think
browse comp plus is an example of this but maybe even harder right like these like deep research style things. Um, and
like the model is given a retriever like a some kind of like either BM25 or or a rag thing. Um, but and so is the RLM.
rag thing. Um, but and so is the RLM.
But the the difficult part is like as it retrieves more stuff, the trajectory gets really long. And it it like an RLM actually is very well suited for this setting. Uh, and and this is something
setting. Uh, and and this is something like I think is actually really interesting to explore. Um, and it it it goes back to this idea of like replacing a basic LM call with an RLM uh in your
system and and seeing what happens. Um,
but yeah.
>> Yeah. Okay. Yeah, that makes a lot of sense. Like I also do do think that like
sense. Like I also do do think that like rags still makes sense unless you equip this thing also with the database and the rag. uh then like I mean like it's
the rag. uh then like I mean like it's the the the difference is that uh with the rag you have like just one shot type of situation but in this case it's
actually mining for the information um which is the part that is uh the most interesting like whether you add the rag or tool call or like uh whatever library you want to add into like this the
system like literally allow it to browse the internet and send an agent to browse whatever I mean like This is just adding onto this core a bit like the chain of
thought um is also doing tool calling and like going and doing this this other stuff um is just adding into the uh the same kind of core.
>> Yeah.
>> Um there's an element that um uh uh was interesting here um that there's a passing recursive lm output true
variable for long output task. So if I understand correctly, um it's offloading this to another sub agent, right? Sub
agent is doing a bunch of stuff and it has a prompt in it or whatever. It goes
into the the variable. It's not looking at what's inside.
>> Yeah.
>> Uh to do the rest of the stuff. So it's
saving a bit its context. It's kind of trusting that this is fine.
>> Yes.
>> Right. Um this is literally what's happening here. So, um, yeah, like you
happening here. So, um, yeah, like you can imagine this is actually a really cool part of of this approach as well that I think uh is is highly underrated.
Um, and actually Prime Intellect's implementation of RLMs doesn't even allow the model the the model to produce
a final answer. It has to like it has to output a variable string and that is the final answer of the RLM. um like that's like that is a an extreme version of
what this is describing here. So um the point of of of this uh kind of uh this this part of the paper is that
basically another large limitation of large language models is their output context window which doesn't get talked about a lot like there is a limitation it's not infinite either. Um, and one of
the really cool things you can do with an RLM is you can out also output nearly infinite or unbounded uh sequence lengths uh of of outputs. Um, and you
can do this in various ways. So um the the trick that we use is basically the model can pick a variable and choose that variable as its actual output as
its final output. Um, and this thing like in in the silliest case, right? Uh,
you can imagine what it does like you give the the RLM a prompt. It takes the prompt like as a variable. It passes
that to a recursive model. The recursive
model answers it, stores it in a variable, and then the RLM just outputs that variable. And that is the same as
that variable. And that is the same as doing a model call. Like these are these are equivalent, right? Um but on the like the more powerful part is that like
you can do like for example if your task is um I have a uh one trillion token Excel sheet and I want you to transform
every row into like a new Excel sheet.
Um the RLM can actually do this. Um it
it basically what it would do is it would chunk up the Excel sheet. It would
spawn a recursive model on on on each chunk. it would save the output to a
chunk. it would save the output to a variable. It concatenate the var like
variable. It concatenate the var like all the output variables into one final one which is like maybe also a trillion tokens and it'll output that. Uh and and and this is very very cool because it
also mixes programmatic things like you you don't have to use the language model itself to to do the final answer. Um,
and this feature actually is what broke a lot of the benchmarks. Like basically
the model because it's so flexible. For
example, like all of the, which by the way I think is kind of silly. All of
these benchmarks that are like can a model do um like 30digit multiplication, right? And like sometimes the answer is
right? And like sometimes the answer is no. And it's like why are we even
no. And it's like why are we even evaluating this? Uh like in this setting
evaluating this? Uh like in this setting like it will just compute it in a variable and output that. Uh, and I think you can do a lot of really really cool things like you can um you you effectively like what the model can do
in the ripple is form an entire like workflow of like how it's going to generate the final answer and this includes both code and uh language model calls. So it's like almost building its
calls. So it's like almost building its own agent scaffold uh like in itself.
It's it's a very interesting thing. Um,
and it's part of the reason why ulong pairs is so hard because oolong pairs asks the model to generate all the pairs that satisfy some property. Um, and you
can do this in a like a programmatic way using like this kind of um passing the outputs to like um across variables. So
yeah, >> uh I I feel it's the right idea at the right time because we already know that like I know 4.5 fantastic coding agent
whatever. Well, fantastic. Now you put
whatever. Well, fantastic. Now you put it into a setup where the only thing he has to do is code right like literally and like okay it doesn't can't do
30digit multiplication stuff >> it can it can write the the script do it and then it's done and then you can just move on to the other task and just stitch it up. It doesn't have massive
context doesn't matter. It can spawn like six version of it and just go and then run.
>> Um so I I I feel like it's it's the right idea at the right time. Um there's
uh this line that was um I just wanted to get you like a rough idea. I know
that you you might be working on this right now. We hypothetize hypothesize
right now. We hypothetize hypothesize that rland trajectory can be used as a form of reasoning which can be trained to by bootstrapping existing frontier module like what will be the workflow
here you think um that you have in mind in order to actually do that.
>> Yeah. So this is actually really tricky in practice. Um but the the core idea of
in practice. Um but the the core idea of what I was trying to say here was that um like in the past right uh now I I want to say this in a way that's like
not confusing to people. So if people find this confusing I can I can uh reframe the way that I say this. Um in
in this last year the way that we have done reasoning models like like what is a reasoning model? A reasoning model is just a model that has been post-trained such that when it's given a question uh
what it will do is it will output this long reasoning trace uh and this trace will also get fed back into the model right so it's like a form of conditioning
um and given this reasoning trace it it it came up with and the original prompt it will come up with a maybe a better answer like a more informed answer to
what it was trying to do. Um, and this is what I like to call reasoning in token space because quite literally it's it is just outputting tokens to come up
with a with an answer. Uh, and the way that these were trained uh you know with with RL for example um although it doesn't have to be is like you you
basically do this this kind of version of uh like rejection sampling. I don't
know if that's the right word. Um, but
like you you get the model to produce these long sequences. Um, and then if it gets the the the question correct, um, like you give it a positive signal, you do the update like this, yada yada, like that's that's like a it's it's simple,
but obviously in practice there's there's a lot of like really nasty parts of it. Um, but the reason why this works
of it. Um, but the reason why this works so well is like this is still the same as just training a model. Like you're
you're just you're just training a model with RL. Um, and this the sequence like
with RL. Um, and this the sequence like is still fed back into the model. So
this whole pipeline is the same thing as if you were to train it in a non-reasoning way. There's actually no
non-reasoning way. There's actually no difference for the most part. Um the
difficulty though with the RLM part and I which is why I think this is also so cool is that the RLM trajectory is way
longer than what fits into the model's context window. So you can't just
context window. So you can't just naively train it the way that you would like there there's there's nothing to like uh like the back propagation is really awkward here or like even the
reward is really awkward here. We now
have a uh like as we usually call in in like the RL community like a credit assignment problem. Um the other really
assignment problem. Um the other really weird thing is that we're not reasoning purely in token space. Now, if that's a confusing term, like I I can explain it
a little bit better, but we are reasoning in code and in token space.
And and not only that, we're reasoning across multiple model calls, which is really weird. Like it's a it's a really
really weird. Like it's a it's a really awkward thing to do. Um, and like how you actually train this model uh such that it never actually uses the full
trajectory to train uh like to do this like reasoning training thing uh is kind of tricky. like I I I think I mean it's
of tricky. like I I I think I mean it's an ongoing thing. Uh we're looking into it. Um I also would be happy if if you
it. Um I also would be happy if if you know Frontier Labs uh are interested in this as well. Um you know I I I don't care who who ends up having the the best model. I just want to see if it works.
model. I just want to see if it works.
Um but it could be you man. It could be you.
>> Maybe. Um
>> but I I it's Yeah, that's I I guess that's that's kind of uh what what that means.
>> Yeah. Have you thought about like evolutionary strategy here? like
>> so >> like I say this because I I was talking to the egg roll um uh guy and then um uh
like another researcher that like is also working on this and like it is comparable to doing like gRPO in some of the benchmarks. um you just have to make
the benchmarks. um you just have to make sure that like you're doing it as optimized as possible like in um uh on the GPU space. But if you can pull it
off then like it doesn't matter what's happening in the middle like it can literally spawn like 100 sub agent like recursively whatever it is. You just
wiggle the stuff right here and then you look at the output and you're like this is good. This is we're we're we're going
is good. This is we're we're we're going to make it more of this and less of the other stuff.
>> Yeah.
Um one question I had like do you think that um Arling or like whatever post training the model for being an arling will have an impact on the number of sub agent being spawned and um just
generally their understanding of the task difficulty. So basically like
task difficulty. So basically like bringing bringing quen 3 closer to like GPT5 level of understanding of like not being silly.
>> Yeah. Um, so I would recommend reading there's this paper called context folding something something I I think it's like a bite dance paper. Um, they
they they do uh something a little bit different than what RLMs do, but it's a similar core idea of like we have multiple model calls and we want to train a model like with RL to be able to
do this kind of thing. Um, and they do a lot of like really interesting tricks like to the the GRPO loss. Um and like with the goal of like reducing the
number of sub agent calls like reducing the length of the the the the root language models trajectory stuff like this. Um my uh my answer to this is it
this. Um my uh my answer to this is it honestly depends a lot on like what your loss is. Uh, and I, um, this is just
loss is. Uh, and I, um, this is just speculation, but I think that, uh, naively training with like how we've done it in the past, like with GRPO or
or maybe some other like modified version is not going to work that well unless you have a lot of data. Um, I
think we are inevitably going to need to bake in some things, at least in the beginning. I think in the future things
beginning. I think in the future things will eventually just simplify out and and and maybe it will just return back to GRPO. I I don't know. But um I think
to GRPO. I I don't know. But um I think for the time being, like you know, if we want to see some like initial cool results with RLMs, like we'll probably have to guide them in certain ways. Like
for example, if we want to post train 3, we kind of have to add a little knob that says like, "Hey, don't do that many sub agent calls." Like I I think this is just what's going to happen. Um
>> yeah. Yeah.
>> Yeah, that makes a lot of sense. And I
also had this other thought, but I think it goes into the same direction of like if it knows that the task is easy or hard, it can decide what type of sub agent it will call and like just use
less compute. So if it knows that like
less compute. So if it knows that like this is a dumb dumb task, it just doesn't want to do it, right?
>> Yeah.
>> Well, we can just like spawn a llama tree and then it's done, right? But if
it knows that he has to do like big thing in here, GPT 5.2, like go for it.
We're going to wait 20 minutes. It
doesn't matter because this is too complicated of a problem and this thing can then orchestrate the rest. Um, but
here we don't know. Uh, there's a funny question on the chat.
How does someone enjoy doing this all the time? Are you just sitting in your
the time? Are you just sitting in your front of computer all day? How is this fun? What's your take on this?
fun? What's your take on this?
>> Yeah. Um, well, so what I will say is, uh, I actually don't think I work that much. Um, so
much. Um, so I would say now this is this might be surprising to some people. I I think the hardest I've ever worked is during my undergrad. I I genuinely think and I I
undergrad. I I genuinely think and I I joke about this a lot with my with my friends. Um like like school can be
friends. Um like like school can be hard. You you can really make it hard
hard. You you can really make it hard for yourself. Um, and I I think like but
for yourself. Um, and I I think like but for me personally like doing that was actually really helpful because um I actually spent most of my time in undergrad not doing uh deep learning or
machine learning stuff. Um other than like research I I I did do some like research stuff but for the most part like the courses I took were just like math or or uh systems or like physics
type stuff. Um, and like that honestly
type stuff. Um, and like that honestly built up enough of a foundation for me to uh explore like I think really simple
idea. Like I think the RLM idea for
idea. Like I think the RLM idea for example is really simple. Um I I I don't think it's like something like some crazy uh like novel thing. Um I I do
think it's it's quite clever. I I that's why I think it it it it's like u a little popular now. But um I think in general like honestly I am I am not of
the opinion that people should work like you know 15 hours a day. I I think that's that's kind of crazy. Um but I I do think just like you know if you enjoy it like naturally you will just spend
time doing these things and you know um I I I also think I've been very lucky.
So I I I will say that too. I I I think things work out differently for for different people. Uh but for me I I
different people. Uh but for me I I would credit my uh my successful streak of research ideas uh from when I started doing ppu mode stuff. Um which is really
weird cuz I it's not related to most of the research I do. Um but I think that's that's when I I started like you know getting involved and and seeing like what kinds of uh problems ex exist out
there. So yeah, I I I think a lot of
there. So yeah, I I I think a lot of problems in ML are still like um not lowhanging fruit, but more like there's a lot of clever ideas that haven't really been articulated very well. Uh
and I I actually think the funny thing about this like just AI stuff in general is um I don't think like we need crazy ideas. Like I I actually think like a
ideas. Like I I actually think like a lot of ideas already exist and float around, but like the way that they become interesting is like when somebody formalizes it or articulates it in a way
where people understand what's going on.
I think star like quiet star and star which are basic. It's like Eric Zelikman's uh work that underpins all the reasoning model stuff um is a great example of this. Like I think the idea
of bootstrapping reasoning traces is not like I'm sure many people thought of it at the same time or even earlier. Um but
it's like his papers that made it clear to people that like this is actually a really good idea. Um and I think there's a lot of ideas like that that are still out there. Um some of them are rooted
out there. Um some of them are rooted in, you know, more uh like theoretically minded individuals like people that like to think in math and um and some of them
are just like super simple. Um, and so yeah, I I don't think there's any like secret recipe for these kinds of things.
Like it really is just um like you spend time in the field and like I think these ideas just kind of um float around. So
>> yeah.
>> Yeah. And also like um for those that are not aware of like research stuff, it it's not necessarily like you just sit at the computer and look at the computer
and like the idea will come from the computer like the the computer is just for doing or like getting information, right? Like the idea you need to kind of
right? Like the idea you need to kind of get an intuition and then like start to read a bunch of stuff. You can take stuff outside. You can just print your
stuff outside. You can just print your your things and then just start to read it like out. you chat with researcher in my view like chatting with the researchers is the best way to kind of get to the
core of it right like you can read the paper yes it's all formal and stuff but like getting like just the background intuition idea also give you some sense of like where the stuff is maybe going
so like actually like it's just a lot of chatting around and at some point you have to code something right >> yeah I mean fundamentals are always super important obviously like and I I
think uh honestly ly this is a hot take but I think the fundamentals for the AI field are like quite shallow so like you you know like if you wanted to get into
pure math or physics research it's quite difficult like it takes years to like you know but AI there's so much like to do and I I think um you know good idea
>> you can just verify it also verify your results >> like okay you have this dumb ass ID you could just try it out man like let me know And then you'll see if it's it's
worth u worth it or not. If you need like massive amount of compute and it's going to be super complicated like realistically won't happen at all, right? So you just have to go and
right? So you just have to go and gravitate toward like less compute insane idea or get an internship at like open.
>> I I actually think a lot of ideas don't require compute. That's I you know the
require compute. That's I you know the funny thing is I think all of the boring ideas require comput like in the sense that like it's the easy way out like it's sort of like well we can train on
this thing right but there's a there's a lot of like like RMS for example does it does require compute but the core part actually doesn't um and I think like
yeah I it's there there's there's really a lot of things that are missing currently um and these ideas can come from anywhere like genuinely like
>> you don't need to be like >> uh super established or or things like that.
>> Yeah.
>> Well, there's a good follow-up question on this. I think that would be the last
on this. I think that would be the last one. Um how would one look for novelty
one. Um how would one look for novelty when looking for research you want to publish? Um somebody is uh studying
publish? Um somebody is uh studying their master thesis like um how how how do you get novel ideas? So I think novel
ideas come from understanding what's going on in the field really well. Uh and that doesn't necessarily
well. Uh and that doesn't necessarily mean like reading a thousand papers.
Like there are some people that do that.
That's great. I mean uh I used to do that. I don't do that anymore. Uh, and
that. I don't do that anymore. Uh, and
part of the reason is I think um, yeah, a lot of a lot of ideas get recycled and I don't think that's a bad thing by the way either. Um, but I I think the way to
way either. Um, but I I think the way to think about this is like once you read into a field enough, you will get frustrated with certain things.
Like some things will just be like this doesn't make sense. Like why is it done this way? And honestly, the answer to
this way? And honestly, the answer to that is usually like because maybe someone hasn't explored it thoroughly.
And it's not because like, oh, it's it's this way because it's the best. Like, it
generally is not true. Um, I think like with with what's to pick, I mean, if you look at my my history of research, it's like all over the place. Like, it's it's genuinely like a bunch of random stuff.
I am not like a specialist in some like in post training, for example. Um, or
like I am not a specialist in context engineering. Uh but I think it just
engineering. Uh but I think it just tends to be that like there like as you just read into certain fields like you will naturally have
questions. Um like you seeine had lots
questions. Um like you seeine had lots of questions for me today and honestly a lot of those questions are research projects of their own like they they could very well be a thing to explore.
Um, and I I think like oftentimes like so this RLN project started basically where um I should give credit to my adviser like very great guy uh Omar he's
he's the guy that did DSPI um like he basically was just like hey like you
know um what if we look at like these models that basically like tool call other models like I don't like I just I don't know what would happen like let's just see what happens. And like
initially we started doing a bunch of stuff and it did not work. Like it was very silly. Uh it was like very dumb.
very silly. Uh it was like very dumb.
And I I'm sure a lot of people have tried it too uh before we settled on like the final idea. But I think like this is just like it's things like that.
It's just like oh like why hasn't this been done before? Uh and yeah uh there there are some sub fields where this is a lot harder to do. So like systems for example, it's a lot harder to pull this
off cuz I think generally in systems like there is a it's not as much of a research question. It's more like
research question. It's more like someone needs to go do this and like you need to just learn how to do it. Um,
Flash Tension is a great example of this. Um, but yeah, I I I I think like
this. Um, but yeah, I I I I think like there's lots of great ideas to be discovered that are that are >> Yeah, that's exactly it honestly. Same
same here. Like u I mean at some point you know the ideas you just have to like commit to one and like push it through.
And it's it's really true like there are a lot of ideas train of thought that are have just taught >> um like uh uh four years ago because the only guy that worked on it graduated
right now working at Mckenzie or whatever the right but it's not pushed like not all direction in the human like edge is being pushed at the same time.
>> Um so what's next for this research direction and how can the community uh be involved here?
>> Yes. Okay. So this is important. The
obvious next direction is training, right? Um and I don't think this is
right? Um and I don't think this is something that can be done that easily in the open uh unless like like unless there is a community like things like Aluther and other stuff that like you
know um have open uh compute and and stuff like this and like more centralized communities where they can train stuff. Um I think like uh a few
train stuff. Um I think like uh a few companies primates like most notably is is working on this now. um like just in general uh can we solve a lot of the problems we talked about today um
through post- training and and can we get a model that is actually um can boost its own performance uh by post- training on the scaffold um very interesting problem uh I think it will
likely will see some results maybe in the next 6 months maybe even earlier if some people are already working on this um I think another big direction which
is maybe the more open-source part that I've been thinking about is uh going back to this Jupyter notebook thing and and more broadly like what is the the actual like interface that we want to
end up with. I and I say this because I think that to make progress on this problem like as a community there needs to be some
standards that are set. Like I I just think if everybody works on this like concurrently with different ideas of what to do, we're it's just going to be a mess because um
everything in ML is about being in distribution now, right? Like let's just be honest like for at least for language models, it's it's it's it's about being in distribution and and and and and trying to mold things in a way where the
model likes to see like what you give it. Um and so I think thinking about how
it. Um and so I think thinking about how this is designed like for this like open source library that we have like um this is super important. Uh, number two is like this whole asynchrony thing. Like
we want this to also be really fast. And
so I can imagine like in the near future we might develop like another type of inference engine but specifically for
RLMs. Um, and like how it deals with uh like how it minimizes basically uh the longest depth of like chained language
model calls. Um, and like how we
model calls. Um, and like how we designed these systems to be used uh on like your local server. Um, and like how we designed the sandboxes, like what is this ripple going to be equipped with?
Um, like what is it even going to be?
Like is it going to be a Docker container or like a Docker image that runs on your your machine? Is it going to be like a sandbox like you use you hook it up to modal um or like your own kind of cluster and like you do
something like this? I think these are all open questions that do not involve a lot of compute. um and that can be discussed and and solved like in in the
open. Uh and so I think these two things
open. Uh and so I think these two things are are are super important. Um number
three, which I forgot about is eval, which going back all the way to the beginning, we need evals. Like and I I I don't even mean long context evals. Like
of course those are important, but I genuinely think actually this is a great one. If you're looking for something to
one. If you're looking for something to do, eval like you know I think Sweetbench I was fortunate enough to be there when it like when John and and Carlos and Ofir were were like
developing this thing. Um but like genuinely it just comes from like hey like this is a naturally occurring problem. Uh can we get a model to solve
problem. Uh can we get a model to solve this? I think we need more benchmarks
this? I think we need more benchmarks like this where models just don't do that well. Um and like these are like
that well. Um and like these are like they try to reflect realistic tasks. Um,
and they will get like hill climbed.
That's of course, but I think like we need more diverse evals and things that reflect like what do we actually want these models to do? Um, because I think
actually that's probably the single most important driver uh for model progress these days.
>> Yeah. Yeah. 100%.
>> Um, no, I have nothing to add, man.
Thank you very much and also for staying like so long afterward.
Um like folks go follow him on Twitter like all the links are in description.
Read the paper. It's a really good one also. Um and thank you very much Alex
also. Um and thank you very much Alex for for coming.
>> Of course. Thank you so much.
>> Honestly, I think like this is this I really like the idea. I really do believe that it's going to be um um it has the same characteristic that we saw
earlier on with like a um reasoning model. I think there's a lot of stuff to
model. I think there's a lot of stuff to do with it. Like there's just like the the the the frontier here is kind of boundless. Um
so uh if you want to get involved, this is a very good like um uh I said like um type of shape to be involved because you don't require training. You don't train these models, right? You set up the
arness and then you tweak a bunch of stuff. Um, so like if you find these
stuff. Um, so like if you find these shape where like there's not a lot of like demanding compute and it's about like kind of understanding qualitatively what's going on and thinking creatively
about like how to set things up. Um,
it's it's a good place to start. The
code is actually open source. Um, so he has uh I'll put it on GitHub. You can
take a look at it, start to tinker about a bunch with it and you're going to have some uh some ideas that then you can share with the community. And that's it for today folks. I hope you enjoyed the
video and don't forget to check out the RLM video from neural next for a more in-depth practical hand-done exercise and have a great
Loading video analysis...