"I Invented the Transformer. Now I'm Replacing It."
By Machine Learning Street Talk
Summary
## Key takeaways - **Transformer Inventor's Remorse**: Llion Jones, co-inventor of the Transformer, drastically reduced his research on them because the space is oversaturated, preferring to increase exploration in new directions. [00:12], [00:36] - **Sakana's Research Freedom Philosophy**: Sakana AI protects researchers' freedom to pursue interesting ideas, inspired by Kenneth Stanley's 'Why Greatness Cannot Be Planned,' countering pressures that narrow creativity as companies grow. [04:13], [05:10] - **RNN-to-Transformer Redundancy Warning**: Endless tweaks to RNNs like LSTMs and GRUs became redundant after Transformers achieved dramatically better results like 1.1 bits per character, suggesting current Transformer tweaks may waste time similarly. [10:45], [11:04] - **Spiral Representation Failure**: Standard networks like ReLUs solve spirals with tiny piecewise linear boundaries for high accuracy but fail to represent or extrapolate the true spiral shape, unlike proper spiral representations. [19:23], [20:19] - **CTM's Neuron Synchronization**: Continuous Thought Machine rethinks neurons as MLPs using synchronization—dot products of activation time series—as representation, enabling biologically inspired dynamics over an internal thought dimension. [32:50], [39:42] - **Natural Adaptive Compute Emerges**: CTM induces adaptive computation without penalties: easy examples solve in few steps, hard ones use more time naturally, unlike Transformers needing hacks, and achieves near-perfect calibration. [38:30], [50:48]
Topics Covered
- Abandon Transformers for Adaptive Compute
- Freedom Fuels AI Breakthroughs
- Current Tweaks Waste Time Like RNNs
- New Architectures Need Crushing Superiority
- Synchronization Enables Natural Adaptation
Full Transcript
despite the fact that I was involved in inventing the Transformer luckily um no one's been working on them as long as I have rights with maybe the exception of the other se seven authors. So I
actually made the decision uh earlier this year that I'm going to drastically reduce the amounts of of research that I'm doing specifically on the
transformer because of the feeling that I have that it's it's an oversaturated space, right? It's not that there's no
space, right? It's not that there's no more interesting things to be done with them. And I'm going to make use of the
them. And I'm going to make use of the opportunity to do something different, right? To actually turn up the amount of
right? To actually turn up the amount of exploration that I'm doing in my research.
We just released the continuous thought machine. It's a spotlight at Europe's
machine. It's a spotlight at Europe's 2025 this year. You should care about it because it has native adaptive compute.
It's a new way of building a recurrent model that uses higher level concepts for neurons and a synchronization as a representation that lets us solve
problems in ways that seem more human by being biologically and nature inspired.
The atmosphere in AI research was actually quite different back during the Transformer uh years um because it doesn't feel like something similar
could actually happen right now. because
of the reduced amount of freedom that we have, right? The Transformers was very
have, right? The Transformers was very very bottom up, right? It's not that somebody had this grand plan that came down from on high that this is what we
should be working on. It was a bunch of people talking over lunch, thinking about what the current problems are and how to solve them and having the freedom
to have, you know, literally months to dedicate to just trying this idea and having this this new architecture fall out.
We've spent hundreds of millions of dollars. The biggest sort of evolution
dollars. The biggest sort of evolution based search is probably in the tens of thousands. We have all this compute.
thousands. We have all this compute.
What happens? What happens if you scale up these search algorithms? And I'm sure you'll find something interesting, you know, when someone eventually does bite that bullet and really scale up these
evolutionary sort of a life experiments because I pitched it in an environment where people were just going all in on this one technology. I got zero
interest.
So now I have my own company and I can pursue those directions.
This podcast is supported by Cyber Fund.
>> Hey folks, I'm Omar, product and design lead at Google DeepMind. We just
launched a revamped vibe coding experience in AI Studio that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app and Gemini
ever. Just describe your app and Gemini will automatically wire up the right models and APIs for you. And if you need a spark, hit I'm feeling lucky and we'll help you get started. Head to
a.studio/build studio/build to create your first app.
>> Two for AI Labs is a research lab based in Zurich. They've got a team of amazing
in Zurich. They've got a team of amazing ML engineers and research scientists.
They're doing some really cool stuff. If
you look at their website, for example, you can see what their approach was for winning the ARC AGI 3 pub uh competition which closed out a few months ago and they are hiring amazing ML engineers and
research scientists. They also care
research scientists. They also care deeply about AI safety. So if any of that is a fit for you, please go to Two for AOLABS and uh give it a go. The
audience will know I'm a huge fan of Kenneth Stanley's ideas. So his book, Why Greatness Cannot Be Planned, changed my life. It was absolutely insane. And
my life. It was absolutely insane. And
what he was speaking to is that we need to allow people to follow their own gradient of interest unfettered by objectives and committees and and so on.
Because that is how we do epistemic foraging. that when you have too many
foraging. that when you have too many agendas involved in the mix, you kind of end up with a gray goo and you don't discover, you know, interesting novelty and diversity. And I suppose that's
and diversity. And I suppose that's basically the thesis of of your company, Sakana, is to lean into those ideas.
>> Yes, exactly. Um, at the company, we're a massive fan of that book. We're we're
hoping to have him come and talk at our company next week, actually. And um it's a philosophy that we we do talk about internally, right? We have copies of the
internally, right? We have copies of the books in including the the recent Japanese translation. As you know, one
Japanese translation. As you know, one of the co-founders, one of my my main jobs, one of the main things that I have to keep doing for this company is making sure that we protect the freedom that
the researchers currently have, right?
Because it's it's it's a it's a privilege really that we have the resources to be able to do that. And
inevitably, as I've seen happen, as the company grows, more and more pressure comes in and it narrows the freedom. But
I think because, you know, we we believe in this philosophy so strongly, I'm hoping that we can give people all the research freedom that we do now um for
as long as possible.
>> And what are those processes that curtail freedom as a company matures? I
mean, how would you describe that? It's
great that there's never been so much interest and people and talent and resources and money in the industry, but unfortunately that just increases
the amount of pressure people have in order to compete with all the other people working on it and trying to get the the value out of this technology and
making money.
And I think that's what just happens, right? As a startup, you have a a
right? As a startup, you have a a feeling of, you know, excitement and trying something new. And right at the beginning, you have a bit of a runway.
So, you have the freedom to try different things. But inevitably, people
different things. But inevitably, people are starting to ask for returns on their investments or they're expecting you to churn out some product. And this just
unfortunately reduces the uh the the creativity that that researchers have because you know the the the the pressures to publish or
the pressure to to create technology that's actually useful for the products that we have goes up and so the feeling of autonomy I think starts to go down.
But you know I literally tell people when they start working for the company I want you to work on what you think is interesting and important and I mean it there there is I mean in YouTube there's
a phenomenon called audience capture >> right >> and I think there might be a phenomenon called technology capture which is that in the early days of Google it was quite
open-ended and I mean transformers is now the ubiquitous backbone of all AI technology and it's a huge achievement that that you're involved in But I mean there's a similar story with with Open
AI. They're now starting to see all of
AI. They're now starting to see all of these commercialization opportunities.
They they can I mean they're going to become LinkedIn. They're going to become
become LinkedIn. They're going to become an application platform. They're going
to become a search platform. They're
going to become a social network. And
and I guess this could happen to you guys that there's a very strong chance, especially with your new paper that we're going to talk about today, this continuous thought machines. It it could be a revolutionary technology, but then
it will become obvious how it could be commercialized. And that's how those
commercialized. And that's how those pressures come in.
>> I I like the I like the audience capture analogy.
I think um there's definitely been some kind of capture by large language models, right? They they worked so well
models, right? They they worked so well that everyone wanted to work on them.
And I'm really worried that we're kind of stuck in this local minimum now, right? and we sort of need to try to try
right? and we sort of need to try to try to escape it. So, we spoke about the transformers, but there's a there's a time just before the transformers that I'd like to talk about because I think
it's quite illustrative. So, of course, the the main technology before transformers was recurrent neural networks, right? And there was a similar
networks, right? And there was a similar feeling, right? When recurrent neural
feeling, right? When recurrent neural networks came in and we we you know, we discovered this new sort of sequence of sequence learning, that was also a
massive breakthrough, right? the the the the the translation quality went up massively, right? Um voice recognition
massively, right? Um voice recognition uh quality went up massively. And there
was this a similar sort of feeling then of like okay yes we've you know we found the technology and we just need to sort of perfect this technology and back then
even my my favorite uh task was uh character level language modeling right so every time a new RNN based character level language modeling paper came out I
got quite excited right um I you know I'd want to like quickly read the paper like okay how did they you know how did they get the improvements But the papers were always these just
these slight modifications on the same architecture, right? It was LSTMs and
architecture, right? It was LSTMs and GRUs and maybe um initializing it with the identity matrix to so that you could use the relu function or like maybe if you put the gate in a different place or
if you if you layer them in a slightly different way or if you had gating going upwards as well as as sideways. Um, and
I remember one of my favorites was this uh like hierarchical LSTM where it would actually decide to compute or not compute the different layers. And if you
trained on Wikipedia and you looked at the structure of when it was decided to compute or not compute, it kind of looked like the structure of of the the sentences were actually being picked up
by the model. And I used to love that sort of stuff, right? Um, but the the the the improvements were always like 1.26 bits per character, 1.25 bits per
character, 1.24. That was a result that
character, 1.24. That was a result that was publishable, right? That was
exciting.
But then after the transformer the team that I went on to afterwards right we applied for the first time very deep transformer models decoder only
transformer models to language modeling and we immediately got something like 1.1 uh right so so something that was so good that people actually come to our
desk and politely tell us like uh I think you you made a error like a calculation do you think it's nats not bits per character and we're like no no
no we you know it really is the the the correct the correct number. What struck
me later is that all of a sudden all of that research and to be clear very good research was suddenly made completely redundant.
>> Yes.
>> Right. All of those endless permutations to to RNN's were suddenly seemingly a waste of time.
We're kind of in the situation right now where a lot of the papers are just taking the same architecture and making these endless amount of different tweaks of like you know where
to put them normalization layer and slightly different ways of training them and and we might be wasting the time in exactly the same way right like I personally don't think we're done right I don't think that this is the final
architecture and we just need to keep scaling up there's some breakthrough that will occur at some point and then it will once again become obvious that we're kind of wasting a lot of time
right now.
>> Yeah. So we are a victim of our own success and this basin of attraction there are so many basins of attraction.
Sarah Hooker spoke about the hardware lottery and this is a kind of architecture lottery and it it it actually made me think of the um agricultural revolution which is that
this kind of phase change happened and all of the folks that had these skills that were so necessary, these diverse skills for living and surviving, they died out. And that's actually quite
died out. And that's actually quite paradoxical because we need those skills to take the next step. M
>> and so we we're now in this regime we've got the term foundation model and the implication is that you can do anything with a foundation model in the corporate world we used to have data scientists
you know they were ML engineers doing these architectural tweaks even in you know midsize enterprise and now we just have AI engineers who are just doing
prompt engineering and so on. So you're
saying that the fundamental skills that we need to be diverse to think of new solutions and new architectures, they're dying out. I think I'm going to disagree
dying out. I think I'm going to disagree with that. I think the problem is we
with that. I think the problem is we have plenty of very talented uh very creative researchers out there,
but they're not using their talents.
Right? For example, you know, if you're in academia, there's pressure to publish, right? And
if there's pressure to publish, you think to yourself, okay, well, I have this really cool idea, but it might not work. It might be too weird, right? It might be difficult to
weird, right? It might be difficult to get it accepted because I have to sort of like sell the idea more.
Or I can just try this new positional embedding, right? The problem is that
embedding, right? The problem is that the current environment both in academia and in companies are not actually giving people
the freedom that they need to do the research that they probably want to do.
I >> mean there's also this interesting thing that even in spite of great new research I mean I was speaking to Seb Hoger and he's got all of these new architectural
ideas and open AI aren't implementing them. I mean Google are doing this
them. I mean Google are doing this diffusion language model which is quite cool. And I I'd like to know your
cool. And I I'd like to know your opinion on why that is. So there's a few philosophies floating around like this concept of a universal representation that there are universal patterns and
the the transformer representations resemble those in the brain. And it's
rather led to this idea of well we don't need to use different architectures because if we just have more scale and more compute then all roads lead to Rome. So why would we bother doing it
Rome. So why would we bother doing it any differently?
>> There's actually better right? There is
actually already architectures that have been shown in the research to work better than transformers. Okay.
But not better enough in order to move the entire industry away from such an established architecture where you're familiar with it. You know how to train it. You know
it. You know how to train it. You know
how it works. You know how the internals work, right? You know how to fine-tune
work, right? You know how to fine-tune them. You have all this software is
them. You have all this software is already set up for training transformers. fine gening transformers
transformers. fine gening transformers inference.
So if you want to move the industry away from that, being better is not good enough.
It has to be obviously crushingly better. Transformers were that much
better. Transformers were that much better over RNNs. Okay, transformers
where you just applied it to a new problem and it just was so so much faster to train and you just got such higher accuracy that you just had to move.
And I think the deep the deep learning revolution was also another example of that, right? Where you had plenty of
that, right? Where you had plenty of skeptics and people were pushing um neural networks even back then and people are going, "No, we think symbolic stuff will work better." But then they
demonstrated it as being so much better that you couldn't ignore it. This fact
makes finding the next thing even harder. Right? That's the gravitational
harder. Right? That's the gravitational pole of always pull pulling you back to, oh, okay, but a transformer is good enough. And yeah, you made a cool little
enough. And yeah, you made a cool little architecture over here that yeah, it looks like it's it's got better accuracy, but OpenAI over here just made it 10 times bigger and it beats that.
So, let's just keep going. May I also submit that there could be an additional reason which is you know I love that fractured entangled representations paper. Um there's there's this shortcut
paper. Um there's there's this shortcut learning problem and I think that there's a little bit of a mirage going on here and there there might be problems with these language models that we don't you know that we're not fully
aware of and there's also this thing that we're seeing that we are starting to bastardize the architecture. So we
know we need to have adaptive computation for reasoning. We know we want things like uncertainty quantification and what we're doing is is we're bolting these things on top rather than having an architecture which
intrinsically does all of these things that we know we need.
>> Yeah. And I and I think the the our continuous thought machine is is an attempt at addressing those um more directly, right? Which which Luke will
directly, right? Which which Luke will be able to tell you more about later.
There's something still not quite right with this the current technology, right?
I I think the the phrase that's becoming popular is jagged intelligence, right?
That the fact that you can ask an LLM something and it can solve literally like a PhD level problem and then you know in the next sentence it can say something just so clearly obviously
wrong that it it's it's jarring, right?
And I think this is actually a reflection of something probably quite fundamentally wrong with the current architecture. As amazing as they are,
architecture. As amazing as they are, the current technology is actually too good. Okay.
Another reason why it's it's difficult to move away from them, right? So
they're too good in in in the following sense. And you you spoke about the fact
sense. And you you spoke about the fact that we have these foundation models.
That's okay. so that we have the foundation that we can do anything with them. Yes, I think current neural
them. Yes, I think current neural networks are so powerful that if you have enough patience and enough compute and enough
data, you can make them do anything.
But I don't necessarily think that they want to, right? we're sort of forcing them like they're universal pro approximators but I think there are
probably a space of you know function approximators that will more want to represent things in the way that a human represents them. So there's actually
represents them. So there's actually quite an obscure paper that is my poster child for this. It's called intelligence matrix exponentiation
>> and I think it was actually rejected.
So, you know, you can probably project uh the image of a figure one, but there's an image of it's solving, you know, the classical spiral data set of needing to separate the two classes in the spiral.
>> Yes.
>> And it has the decision boundary for a for both a classic RNN uh multi-layer perceptron and a tanh multi-layer
perceptron. And you can see they both
perceptron. And you can see they both solve it, right? Technically, they both solve the problem because they they they classify all the points correctly and
get a very good test score on this on this very simple data set. And then they show you the decision boundary for the for the M layer that they built in this
paper and it's a spiral.
The layer represented the spiral as a spiral. Sh shouldn't we should you know
spiral. Sh shouldn't we should you know if the data is a spiral shouldn't we represent it as a spiral? And then if you look back at the decision boundaries for for the spiral and the classic relu
multi-layer perceptron, it's clear that you just have these tiny little peacewise linear separations.
Um, and that's what I mean. Yes,
if you know if you train these things enough and you push these little peacewise linear boundaries around enough,
it can it can fit the spiral and get a high accuracy. But there's no feeling
high accuracy. But there's no feeling when I look at those that that image that the relu version actually
understands that it is a spiral, right? And when you represent it as a
right? And when you represent it as a spiral, it actually extrapolates correctly because the spiral just keeps going out.
>> You're touching on something fascinating there because, you know, we were talking about the need for adaptivity and um adaptive computation. Um I'm really
adaptive computation. Um I'm really inspired by Randall Bisreerero's spline theory of of neural networks and we we've had them on many times and you can look on the TensorFlow playground. You
can look what happens when you have a ReLU network on on this, you know, spiral manifold. And, you know, you'd be
spiral manifold. And, you know, you'd be forgiven for thinking that these things are basically a locality sensitive hashing table, right? Because they they do they they they they partition the
space and and they they can predict the spiral manifold, right? But we want to do something a little bit more different than that. And it also comes into this
than that. And it also comes into this imposttor thing because just tracing the spiral manifold but not continuing the pattern there's a big difference between that. So from an
imposttor perspective just just tracing the pattern is not learning it abstractly or constructively. Right? If
we learned it constructively so we you know you speak about this in your paper this complexification the abstract building blocks and you can do adaptive computation. you understand the spiral.
computation. you understand the spiral.
That means that with adaptive computation, you can continue the spiral and then you can update the model's weights so it has adaptivity because that's so important for intelligence. So
we know that we need models that can do these things. But for some reason
these things. But for some reason they're they're so sick of fantic they're almost better than an adaptive intelligent system because they tell us exactly what we want to hear. They seem
so intelligent, but we know that they're missing these fundamental properties.
>> I'm still fairly skeptical when I see video generation models.
You know, we went through a phase where you could detect them because of the number of fingers on somebody's hand, right?
And yes, with more data, with more compute, with better training tricks, okay, they
submit. And now they usually do have
submit. And now they usually do have five fingers. But did we fix the problem
five fingers. But did we fix the problem or did we just use more brute force to just you know force the the neural network to know it's five fingers where
something that actually had a much better kind of representation space.
It's almost mad that it's controversial to say that we should represent a spiral like a spiral. But, you know, something that could do that generally that if if it represented a human hand the way
that, you know, maybe I represent a human hand, then maybe it would be much easier to count how many fingers are on a on a hand. It's unfortunate
that they work so well. It's unfortunate
that scaling works so well because it's too easy for people to just sweep these problems under the carpet. You guys have possibly created what I think might be
the best paper of of the year. This
could actually be the innovation which takes us to the next step. And you did you get the spotlight in Europe as well?
>> Yeah.
>> This year and congratulations on that.
So I think that's testament to how amazing this paper is.
>> The CTM the continuous thought machine.
It's actually not that far out outside of the local minimum that we're stuck in. Right? It's not as if we went and
in. Right? It's not as if we went and found this completely new technology.
Right? We took quite a um a simple biologically inspired idea, right, of these of of the fact that neurons synchronize and not even necessarily in
a biologically plausible way, right?
Brains don't literally have all their neurons wired together in a way that they work out their synchronization.
But it's it's the sort of research that I want to encourage people to do. And the way to sell it is quite easy. I think at no
point did we have to worry about being scooped, right? That stress was taken
scooped, right? That stress was taken away from us completely. So, and there was so there was no pressure to sort of rush out with this with this idea because we're like, well, there's
probably somebody else working on exactly this. And I think the reason
exactly this. And I think the reason that we, you know, we were able to get a spotlight is because we're able to create such a polished paper. We took
the time to do the science properly to get the the base the baselines that we wanted and do all the the the the tasks that we wanted to try. Encouraging
researchers to take a little bit more of a of a risk, right? To try these slightly more speculative long-term ideas, I think is is the sad thing is I
don't think it's necessarily a very difficult thing to sell. And I want to have the CTM as like a poster child of it works, right? It was a bit of a risk.
We, you know, we didn't know if we were going to find something interesting, but you know, it was our first shot and we did find something interesting and it became a a successful paper. If if we do
find a system which can acquire knowledge, design new architectures, do the the open-ended type of science that you're speaking to, can you see a future
where at some point the locus of progress will be mostly driven by the models themselves?
>> I think so. Whether or not that's going to replace us completely, I go back and forth on. Powerful algorithms are
forth on. Powerful algorithms are finding uh helping us do research, right? And I think it might just end up
right? And I think it might just end up being a more powerful version of that.
Right? So I I know the the AI scientist that we we released, we showed that you could actually go end to end, right? Go
from seeding the system with an idea for a research paper and then just take your hands off and just let it go.
Think about the idea, write the code, run the code, collect the results, and write the paper. uh to the point that we were actually able to get it to um a
100% AI generated paper accepted to to a workshop recently, right?
But I think we did that to show that you could do it right as a sort of demonstration in a real system. I think I would want it to be much more interactive,
right? I would want to be able to seed
right? I would want to be able to seed with an idea and then have it come back with more ideas, have a discussion with me, then go away to write the code. I
want look at the code and check it and then discuss the results as they're coming out. So that's the sort of
coming out. So that's the sort of nearterm future that I that I would envision or how I would like to do research with an AI.
>> And could you introspect on that? Is it
because you feel we need supervision because the models don't yet understand?
You know there's this path dependence idea. So we need to do supervision
idea. So we need to do supervision because we have the path dependence so we can guide the generation of the language models. Maybe in the future the
language models. Maybe in the future the language models will just understand better themselves. But there's also the
better themselves. But there's also the output dimension which is that we want to produce artifacts that extend the fogyny of human interest. We want it to be human relevant.
>> Yeah. I I think it's more that you know in that initial seed idea it's probably impossible to actually describe exactly what you want. It's exactly the same with, you know, when I have an intern.
I can't just have an intern come into the company and I go, I have this mad idea and then just explain it to them and then just leave them alone for 4 months.
There's a back and forth because I have a particular idea that I want to explore and I need to keep steering them in the direction that I I you know that I had in my my mind originally. So I think
it's more like that basically. You have
such a deep understanding. So you have this rich provenence and history and path dependence and that means you can take creative steps intuitive steps for you respect the fogyny. They respect all
of this deep abstract understanding that you have and interns don't yet have that >> but maybe AI models in the future will have that.
>> Yeah sure. If they if they get to the point where my inputs becomes detrimental then yeah that'll be a thing. It's kind
of like chess, right? There was a point at which chess engine and human fusion actually beat chess engines. That's not
that's not true anymore, right? Adding a
human into the mix actually makes the the bots worse.
>> Oh, interesting. I wasn't aware of that.
>> Yeah. So, so what to do when that day comes for AI scientists is a is a is a broader discussion. I think
broader discussion. I think >> I think now is a good segue to talk about this paper in a little bit more detail. So this continuous thought
detail. So this continuous thought machines you were just pointing to it before. Luke first first of all first of
before. Luke first first of all first of all mate introduce yourself >> and set this thing up for us.
>> My name is Luke. I am a research scientist at Sakana AI and uh my primary sector of research is this continuous thought machines. It's took us somewhere
thought machines. It's took us somewhere in the region of about eight months working on this project with uh the whole team. Um I I did a lot of the work
whole team. Um I I did a lot of the work uh but we also had a lot of people in different areas and doing different parts of it that I think an 8-month life cycle for a paper seems a bit long for
AI research at the moment. Um but yes to the to the actual technical points of the paper. So we call it continuous
the paper. So we call it continuous thought machines. It originally had a
thought machines. It originally had a different name. We called it
different name. We called it asynchronous thought machines before but uh every single time people asked us what's the asynchronous part it became a bit confusing. So continuous thought
bit confusing. So continuous thought machines basically depends on three novelties. Uh the first one is having
novelties. Uh the first one is having what we call a internal thought dimension and this is not necessarily something new. It's related conceptually
something new. It's related conceptually to the ideas of latent reasoning. Uh and
it's essentially applying compute in a sequential dimension. And when you start
sequential dimension. And when you start thinking about ideas and problems uh in this domain and in in this framework, you start understanding that many problems that look like intell or
solutions to problems that look intelligent are often solutions that have a sequential nature. So for
instance, one of the primary tasks that we tested in the continuous thought machines was this maze solving task. And
solving mazes for deep learning is is quite trivial. It's really easy to do if
quite trivial. It's really easy to do if you make the task easy for machines. And
one of the ways to do this is you give an image of a maze to a neural network like a convolutional neural network and
it outputs a image uh same size of the maze and it's uh zeros where there isn't a path and there's ones where there is a path. There's some really brilliant work
path. There's some really brilliant work showing how you can train these in a careful way and scale them up essentially indefinitely. And this is
essentially indefinitely. And this is fascinating and uh really interesting idea of how to solve this. However, when
you take that uh approach out of the picture and you ask what is a more human way to solve this problem, it becomes a sequential problem. You have to say well
sequential problem. You have to say well go up, go right, go up, go left, whatever the case may be to trace a route from start to finish. And when you
constrain that simple problem space uh and you you ask a machine learning system to solve it like that turns out to actually get much much
more challenging. So this became our
more challenging. So this became our hello world problem for the CTM and applying an internal sequential thought dimension to this is how we went about solving this.
Uh two other novelties that we can touch on and talk about. Uh we we sort of rethought the idea of what neurons should be. There is a lot of excellent
should be. There is a lot of excellent research in this world uh in cognitive neuroscience particularly exploring how neurons work in biological systems. And then we get on the other side of the
scale how deep learning neurons work which the quintessential example is a relu. It's off or on in a sense. And
relu. It's off or on in a sense. And
this very very high level abstraction of neurons in the brains feels a little bit myopic. So we approached this problem
myopic. So we approached this problem and said well let's let's on a neuron by neuron basis let this neuron be a little model itself. And this ended up doing a
model itself. And this ended up doing a lot of interesting work on how to build dynamics in the system. The third
novelty here is as I said before we have this internal dimension over which thinking happens. We ask the question,
thinking happens. We ask the question, well, what is the representation? What
is the representation for a biological system when it's thinking? Is it just the state of the neurons at any given time? Does that capture a thought, if
time? Does that capture a thought, if you wish, if I can be controversial and use the term thinking and thought and my philosophy with this is no, it doesn't.
That the concept of a thought is something that exists over time. So, how
do we capture that in in engineering speak? We instead of measuring the
speak? We instead of measuring the states of the model that is recurrent, we measure how it synchronizes how neurons synchronize in pairs along with
other neurons. And this opens up the
other neurons. And this opens up the door to a huge array of things that we can do with this type of representation.
>> You were talking about this um sort of sequential nature of of reasoning and devil's advocate. I mean there was that
devil's advocate. I mean there was that anthropic biology paper and they were talking about planning and thinking and and they they were they were saying that this thing is planning ahead because
because I think your system actually we can say does planning it it's it's actually different computationally can you explain that >> yes I think the boundary in terms of
computation from a a cheering machine perspective if you wish uh is really interesting because the notion of being able to write your tape uh read from a tape and then write again to be in a
Ting compute system ting complete system is uh obviously an incredible idea that has completely changed the world and I think the primary difference with let's
talk about transformers versus what we're trying to do with the CTM is that the process that the CTM thinks in we can apply that process that
internal process to uh breaking down a problem. So the problem itself can be a
problem. So the problem itself can be a single there is a single solution to this problem and you could do that in one shot. You could as I explained with
one shot. You could as I explained with the maze you could just process that in one shot but there are certain phrasings of problems that are real problems that doing so becomes exponentially more
challenging. So in the maze task, a
challenging. So in the maze task, a really good example is that if you try to predict 100 200 steps down the path in one shot, no models that we could
train, not even our model could do that.
And we needed to actually build an autocurriculum system where the model first predicted the first step and then when it could predict the first step, then it we started training it on the second and third and fourth step. And
the sort of resultant behavior of this is where it gets interesting. One of the one of the ways that I like to do research and that I encourage people who work with me to do research is
understand the if you wish the behavior of a model. We're getting to a point now where the models that we build are demonstrably intelligent in ways that
keep surprising us and breaking that down into a single set of metrics or even a finite single metric about performance seems maybe not to be the
right way to do it for me. and
understanding the behavior and the actions that those models take when you put them in a system and train them in a certain way uh seems to reveal more about what's actually going on under the hood.
>> Very cool. And I think I didn't pick up on this. So So you're you're doing a
on this. So So you're you're doing a fixed number of um steps so you have like a a context window and did you say that you've set that around 100 steps?
>> So for the for the maze task uh the model always observes the full image at every step. uh the CTM will absorb
every step. uh the CTM will absorb observe the full image for argument sake those images could be tokens from a language uh the output of a language model those uh inputs could be numbers
that the model has to sort whatever the case may be it should be agnostic to data that's how we've tried to build it but in the maze task the model can continuously just observe the data uh no
matter where it can look at the whole image simultaneously but it uses attention to retrieve information from the data and it has let's call it 100
steps that it can think through. And
what we do is we pick up at some point the model solves three steps through the maze. So it says I'm going to go up, up,
maze. So it says I'm going to go up, up, and right. And then it's correct. But
and right. And then it's correct. But
then it makes the wrong turn. And at
that point, we stop supervision. We only
train it to solve the fourth step. So
one more than what it could. In
practice, we do it five, but the principle holds. And when you do that,
principle holds. And when you do that, it's a self bootstrapping mechanism. And
I think the uh intuitive listener will understand how that extends to other domains, other sequential domains for instance like uh language prediction, many tokens ahead, that sort of thing.
>> So I'm really interested in this idea of adaptive computation. So I I guess the
adaptive computation. So I I guess the first question is how sensitive was the performance to the number of steps and then the next question would be could you have an arbitrary number of steps
which means that you know perhaps based on uncertainty or you know some kind of criterion you could do fewer steps and then the final question is could you have potentially like an arbitrary or
unbounded number of steps >> yeah uh really super question I think that I think that I'll answer the uncertainty question first about the sensitivity to steps. So a very good example of this is
steps. So a very good example of this is we just trained the model on imageet classification and our last function is quite simple. What we do is we run it
quite simple. What we do is we run it for for example 50 steps and we pick up two points two distinct points. The
first one is where is it performing the best i.e. where is the loss the lowest
best i.e. where is the loss the lowest and the second one is where is it most sure or where is it most certain and those give us two indices uh between 0
and 49 inclusive and we apply cross entropy at both of those points we just make the last the average of the cross entropy at those points. So what this does is it induces a behavior where easy
examples are solved almost immediately in one or two steps whereas more challenging examples will naturally take more thinking and it enables the model to use the full breadth of time that it
has available to it just in a natural fashion without having to force it to happen. So you've decided to model every
happen. So you've decided to model every neuron as an MLP which is really fascinating. Talk about that but also
fascinating. Talk about that but also there's this notion of synchronization and I think you use the inner product to determine the extent to which the parameters are are synchronized and this
kind of unfills over over time as as the driving force. Can you explain that in a
driving force. Can you explain that in a bit more detail?
>> Absolutely. I think it's a it's a good point to explain the uh neuron level models as we call them in the paper or NLM first because it ties into this. So
you can imagine a recurrent system is a state vector, a state vector that is being updated from step to step. We
track that state vector and that state vector unfolds and for each individual neuron, each uh I neuron in the system,
we have a unfolding time series. It's a
continuous time series. Well, it's
discreet, but it's a continuous value.
And those time series define what we call the activations over time. And
synchronization is quite simply just measuring the dotproduct between two of these time series. So you have a system of d neurons and essentially you have d
over two squared different synchronization pairs. So neuron one can
synchronization pairs. So neuron one can be related to neuron 2 by how they synchronize and neuron one can also be related to neuron 3 etc etc. The neuron
level models they function by taking in a finite history like a FOQ of neuron of activations coming in and instead of being just a radio activation they use
that history as information to uh process a single activation out and that is what moves from what we call pre-activations to post activations. And
the principle here is that this might seem rather arbitrary and does it help for performance? Turns out it does, but
for performance? Turns out it does, but that's not really the catch all solution here. That's not what we're after. Uh
here. That's not what we're after. Uh
what we're after here is trying to do something biologically plausible. Uh
find the line somewhere between biology, which is how the brain implements things in the biological substrate that we have versus deep learning, which is highly parallelizable, super fast to learn,
back propanable, all of the nice properties of that that have got us this far. and find a line somewhere where we
far. and find a line somewhere where we can take some sprinkling of biological inspiration but still train it with deep learning. And it turns out that neuron
learning. And it turns out that neuron level models is a nice interim that we can do this with. The concept of synchronization is applied on top of the outputs of those neuron level models. So
on on this on the scaling the I think the time complexity is quadratic in respect of the dimension of the synchronization matrix right and in your paper you were talking about subsampling
to improve the performance but how how did that affect the the stability and the you know like were there any things that that cost you doing that? Yeah,
it's a neat question. I think in terms of stability, what's what we found was kind of fun and this was a sentiment that we had throughout the the experiments that we ran with this paper was it tended no matter what we tried it
on it it just kind of worked with all spreads of hyperparameters. uh and this the problems that you have with back prop through time typically with recurrence models like RNNs and LSTMs it's a challenge and you run for many
internal ticks with the RNNs or the LSTMs and the learning seems to break down but the uh fact that we use synchronization in some sense touches all of the neurons through all of the
time so it really helps with gradient propagation uh a nice interesting point that's maybe a bit oblique to what you asked about synchronization is we have a system of d neurons and like I said earlier there
d over two squared possible combinations.
This essentially means that our underlying state or underlying representation to the system is quite a lot larger than what you would get with just taking those D neurons. And as to
what that means in terms of downstream computation and performance and the things that we can do with this is what we're actively exploring right now.
>> You guys used an exponential decay rate.
>> You have the system that unfolds over time. it would be maybe a little bit too
time. it would be maybe a little bit too constrained if the synchronization between any two neurons depended on the same time scale. So for instance, there are neurons in your brain that are
firing over very long time scales and very short time scales. The way that they fire together impacts other neurons and causes those neurons to fire. But
everything in biological brains happens at diverse time scales. It's why we have uh different brain waves for different thinking states for instance. Uh but
beside that point, what we do with the exponential decay in the continuous thought machines is it allows us for a very sharp decay to say that these two neurons that are pairing together, what
only really matters is how they fire together right now. Right? But if we had a very long and slow decay, essentially that's capturing a global sense of how those neurons are firing over an
extremely long period of time. So this
was essentially a way of us uh capturing this idea of how different neurons could maybe fire together very quickly and other neurons can fire together very
slowly or not at all. And this lets that representation space that I spoke about that D over2 squ representation space lets it again become more rich and we can enrich that space with more subtle
tweaks to how we compute those representations.
So we were speaking about this yesterday, Luke, that um when folks apply transformers to things like the ARC challenge or things that need reasoning. Um we need to do lots of
reasoning. Um we need to do lots of domain specific hacks. So the architects who were the winners of last year's challenge, they did um depth first search sampling and some folks have been
experimenting with using language representations or you know using um DSLs and some part of this is to do with the the the reachability um of language right and and language is is quite dense
which means you can kind of monotonically um increase but if I understand correctly your system might have some interesting properties for reasoning and for discrete and sparse domains and also for sample efficiency
because we want we want to build a system that can actually do well on things like the arc challenge. But can
you kind of explain in simple terms why you think this architecture could be significantly better than transformers for doing those things?
>> I think a lot of the really fascinating work in the last few years that I found fascinating in the literature of language models has been related to what one can actually call a new scaling
dimension. I in some sense see continue
dimension. I in some sense see continue uh chain of thought reasoning as a way of adding more compute to a system.
That's obviously just one small part of what that really is and what that really means. But I think it's quite a profound
means. But I think it's quite a profound breakthrough uh in some sense. Now what
we're trying to do is is have that reasoning component be entirely internal yet still running in some sort of sequential manner. And I think that
sequential manner. And I think that that's rather important. And you spoke earlier about Gemini's diffusion language modeling and I think that there are a lot of different directions that
are exploring this right now. uh I do think that the continuous thought machine with the ideas of synchronization and multi-hierarchical temporal representations gives a certain
flexibility on that space that uh other people are not yet exploring and that richness of that space being able to project the next step to solve the arc challenge and the next 100 the next 200
steps to be able to break that down into a process that a model can then uh very quickly search that process in its highdimensional latent case becomes something that feels like a good
approach to take.
>> Do you see any relationship between this architecture and you know Alex Graves neuro touring machine?
>> Yes, that's really interesting. Um I do.
I think that the one of the the most challenging parts about uh working with a neural neural touring machine is the concept of writing to memory and reading to memory because it is a discrete
action. Um and that's that has its own
action. Um and that's that has its own challenges associated with it and yes uh I wouldn't go so far as to say that the continuous thought machine is
definitively nearing Tur incomplete but the notion of the notion of doing reasoning in a space that is uh latent and letting that space unfold in a way
that is uh rich towards a different set of tasks. And this this actually brings
of tasks. And this this actually brings me to a point that I find quite interesting um that I'd like to share with you. Consider again the imageet
with you. Consider again the imageet task or any sort of uh classification task. It's it's a nice test bed. There
task. It's it's a nice test bed. There
are many images that are really easy and there are many images that are really difficult. When we train for instance a
difficult. When we train for instance a vit or a CNN uh to do this task, it has to nest all of that reasoning in the same space. It has to put all of its
decision-m process for a very simple obvious cat versus some complex weird underrepresented class in that system in that data set and it has to nest it all
in parallel in a way that is we get to the last layer and then we classify. Um
I I think breaking that down where you have different points in time where you can say now I'm done I can stop versus now I'm done I can stop let you take a
data set or take a task and actually naturally segment it into its easy to difficult components. And I think we
difficult components. And I think we know that curriculum learning and learning in this continuous sense again seems to be a good idea. It's it's how humans learn. And if we can get at that
humans learn. And if we can get at that architecturally and just have that fall out in a model, again, this seems like a something worth exploring. Uh I'm not sure if you know much about model
calibration and how neural networks tend to be poorly calibrated.
>> Oh, go for it, Tommy. Um it's it's a bit of an old finding, but if you train a neural network for long enough and it fits really really well and you've regularized it regularized it really really well, you'll find that the model
is unccalibrated, which essentially means that it is very certain uh about some components where some classes where it's wrong and uncertain for some classes where it's correct. Essentially
what you want for a perfectly calibrated model is if it predicts a uh probability that this is in class the correct class with 50%. 50% of the time you want it to
with 50%. 50% of the time you want it to be correct about that class and so on and so forth. So a well-c calibrated model if it's predicting a probability
of 0.9 that it is a cat then 90% of the time it should be correct. And it's
actually turns out that most models that you train for long enough get poorly calibrated. And there are loads of post
calibrated. And there are loads of post hawk tricks uh to fixing this. We
measured the calibration of the CTM after training and it was nearly perfectly calibrated which is again a little bit of a smoking gun that this actually seems to be probably a better
way to do things. The flavor of this kind of research is such that we didn't actually go out and actually try to create a very well-c calibrated model,
right? And we didn't even try to create
right? And we didn't even try to create a model that was necessarily going to be able to do some kind of adaptive computation time, right? Um I was um a
very big fan of the the paper um uh yeah adapted computation time was Alex Graves was it? But that paper um it had a
was it? But that paper um it had a massive amount of hyperparameter sweeps in it because in that paper he needed to have a loss on the amount of computation
that was being done.
>> Because anytime you try to do some sort of adaptive computation time research, what you're fighting is the fact that
neural networks are greedy, right? because obviously the way to get
right? because obviously the way to get the lowest loss is to use all the computation that you have access to. So
unless you had like an extra loss that had a penalty that said okay actually you're not allowed to use all the computation that's and and very very carefully balance loss that's when you
actually got the interesting dynamic computation time behavior falling out of the the model in that paper.
But was really gratifying to see with the the continuous thought machine is that because of the way that we set up the loss that Luke described earlier,
adaptive computation times seem to just fall out naturally.
So that's more the way that I think research should go.
>> Okay? because we we don't actually have like a specific goal um like or a specific problem we're trying to fix like that or something
we're trying to invent. It's more that we have this interesting architecture and that we're just following the gradients of interestingness.
>> Yes. And on on that point, I I think maybe the most exciting thing about your paper is, you know, we were talking about path dependence and um having this understanding which is built step by
step, this process of complexification and u I mean maybe this is this is um apppropo in in the theme of world models in in general and also active inference and I say active inference in big quotes
because it's not KL Friston's active you know maybe adaptive inference or something like that but we want to build agents that can continue to learn that can update their parameters and most
importantly can construct path dependent understanding and because it that's completely different to just understanding what the thing is. It's
how you got there is very important and this architecture potentially allows these agents using this algorithm to explore trajectories in spaces find the
best trajectories and actually construct an understanding which carves the world up by the joints. Yeah, that's a that's a really neat perspective. I haven't
actually thought about it like that, but yes, I think um that particular stance becomes really interesting when you
think about ambiguous problems because carving the world up in one way is as performant as carving it up in another way.
>> Yeah. uh you know perhaps the hallucination in language models is carving the world up in some fine way but it's just not performance in our measure of this is hallucination and
actually that's not true but in some other trace down the path of wanting to carve the world up through a auto reggressive generation of tokens you end up in a different carve up of that world
and being able to train a model that can be implicitly aware of the fact that it is actually carving up the world in a different way and and explore those
manners, those uh descents down the carve up is something that we're after and I think it's quite an exciting approach to be trying to take a stance
of let's let's break up this problem into small solvable parts and learn to do it like that and how can we do this in a natural way without too many hacks.
Yeah, it's something I've been thinking about because um Shalet as much as I love his measure of intelligence um ideas is for him adapting to novelty is
getting the right answer and the reason why you gave that answer is very very important and in machine learning we have this problem that we we come up with this kind of cost function that
rather leads to this shortcut problem but you know we could just build a symbolic system we could be gi and and we could say okay we need do this um principled kind of construction of
knowledge maintaining semantics. Well,
we're not doing that. We're doing a hybrid system. But there must be some
hybrid system. But there must be some natural way of doing reasoning where in spite of the end objective being this cost function that because of the way
that we traversed these open-ended spaces that we can actually have more confidence mechanistically that we're doing reasoning which is aligned to the
world. I think that's a great way of
world. I think that's a great way of seeing this particular uh avenue of research and I think that obviously we're not the only people thinking like this and we're not the only ones trying
to do this. Um what we have is an architecture that's amanable to it and surprisingly so it wasn't again wasn't the goal. It's not the goal to to do
the goal. It's not the goal to to do this type of research. It's not the goal to be able to break the world down into these small uh chunks that we can actually reason over in in a way that
seems natural. Instead, what we did was
seems natural. Instead, what we did was pay respect to the brain, pay respect to nature and say, well, if we build these inspired things, what what actually
happens? What what different ways of
happens? What what different ways of approaching a problem emerge? And then
when those different ways of approaching a problem emerge, what big philosophical and uh intelligence-based questions can we then start to ask? And that's where
we're at right now. So it might feel at times, especially for me, uh too many questions and too few hands to answer those questions. But I think the fun and
those questions. But I think the fun and exciting thing and the encouraging thing that I I can you know try to encourage other younger researchers out there is that uh you know do what you're passion
passionate about and figure out how to build the things that you care about and then see what that does. See what doors that opens up and see how to explore deeper into those domains.
>> We were talking about this yesterday, weren't we? That you can think of
weren't we? That you can think of language as being a kind of maze.
>> Yes. like what is to stop us from taking this architecture and building the next generation language model with it. I
mean that that's honestly as you know something that I am actively trying to explore right now and uh yeah I think the maze the maze task gets really interesting when you add ambiguity to it
when there are many ways to solve the maze and honestly this isn't something I've tried yet and maybe it's something I should try next week but it's essentially you can imagine an agent or the CTM in this case observing the maze
and taking a trajectory and surprisingly we saw this we have a section in our recently updated paper on AR archive the final camera ready version of this paper where we added an extra supplementary
section that is not in the main technical report and that supplementary section is basically hey we saw this cool stuff happen and we list I think 14 different interesting things that
happened while we were doing the research um that obviously didn't make it into the paper but we wanted people to know about these strange things that happened and this is one of the strange
things where uh we watched during training what was happening. And at some time during
happening. And at some time during training, maybe halfway through the training run, we could see what the model would do is it would start going one path in the maze and then suddenly it would realize, oh no, damn, I'm
wrong. And would backtrack and then take
wrong. And would backtrack and then take another path. But eventually it gets
another path. But eventually it gets really good and it does you some sort of distributed learning in this because it's got a attention mechanism with multiple heads. So it can actually just
multiple heads. So it can actually just figure out how to do this pretty well and refine its solution. But sometime
early on in the the learning it descends multiple paths and comes back and backtracks. We have a really fascinating
backtracks. We have a really fascinating set of experiments that also showed and this this we actually have some supplementary material online showing this where uh and I don't really know what this says. It's kind of a deep
philosophical thing but if you're trying to solve a maze but you don't have enough time. Turns out that there's a
enough time. Turns out that there's a there's a foster algorithm to do it. And
this was this blew my mind when I saw it. So if we constrain the amount of
it. So if we constrain the amount of thinking time that the model has but still get it to try solve a long maze instead of tracing out that maze, what it does is it quickly jumps ahead to
approximately where it needs to be and it traces backwards and it fills in that path backwards and then it jumps forward again leaprogs over the top and traces that section backwards and then leap
frogs and it does this fascinating leaprogging behavior that is based on the constraint of the system. And again,
you know, this is just an observation we made and what that means uh in a deep sense and how it's related to uh giving a model time to think versus not and is it enough time to think? What happens?
What different algorithms does the model learn when you constrain it in this way?
I find that quite fascinating and an interesting thing to explore. Does it
tell us something about how humans think? Does it tell us something about
think? Does it tell us something about how how we think under constrained settings versus open-ended settings?
There's a number of cool questions you can ask on this front.
>> You you guys are both huge fans of um you know population methods and collective intelligence and because we can we can scale this thing up and we can scale it out and what would it mean
to scale this thing out not only just in a kind of um what do they call it trivial paralization but in terms of having some kind of weight sharing between parallel models and so on. What
what what would uh what would that give you potentially?
>> Uh this is this is a fun area of research. So one of the active things
research. So one of the active things that we're trying to explore in our team is uh concepts of memory long-term memory and what what does this mean for a system like this? So an experiment
that one can construct for instance is to put some agents in a maze and let them try solve this maze not not how we did it in the paper but in a very
constrained setting where a agent can only see maybe a 5x5 region around it and we give that agent some mechanism for saving and retrieving memories and
the task if you wish is to solve that maze find your way to the end and the model needs to learn how to construct memory such that it can get back to a point where it's seen before and know I did
the wrong thing last time and go a different route and you can then see this with uh parallel agents in the same maze with a shared memory structure and see what actually happens when you can
all access that memory structure and have a shared global like almost like a cultural memory that we can access and solve this global task by having many agents trying to use this memory system
and I do think that memory is going to be a very key element to what we need to do in the future for AI in general.
>> So the subject of uh reasoning came up just a second ago and I think there's a perception that recently we made a lot
of progress in reasoning right because it's actually one of the main things that I think people are are working on.
We released a data set recently called uh Sudoku bench and I was actually quite happy to see it come up organically on your uh podcast a few weeks ago.
>> Chris Moore, >> right?
>> Yes. So,
I I wanted to tell you a little bit about this benchmark because I think I've been having a little bit of issue promoting it because it doesn't on
the surface sound particularly interesting because Sudoku has a sort of a feeling that it's already been solved,
right? So, how interesting can a
right? So, how interesting can a collection of of Sudokus be for reasoning? Exactly. We're not talking
reasoning? Exactly. We're not talking about normal sedokus.
We're talking about variant sodokas. And
what variant sodokas are are usually normal sedokus, right? So put the numbers one to nine in the row, the column, and the box,
but then literally any additional rules on top of that.
And they're all handcrafted.
they all have extremely different constraints.
Um constraints that actually require very strong natural language understanding.
So for example, there's one puzzle in the data set where it tells you the constraints of the puzzle in natural language and then says, "Oh, by the way, one of the
numbers in that description is wrong."
Right? So you have to be able to meta reason about the rules themselves even before you start uh solving the puzzle.
There are other puzzles where you have um a maze overlaid on the sodoku and the rat has to work out a way through the
maze by following uh a path to the cheese. But then there are constraints
cheese. But then there are constraints on the path that it takes of like what numbers and what they can be add up to.
Um it's difficult to really describe how varied these these uh these variants sedokas are and I think they're so varied
that if anyone was actually be able to beat our benchmark they would necessarily have to have created an extremely
powerful reasoning system.
Right now, the best models um get around 15%, but they're only the
very very simplest and the very uh smallest Sudoka puzzles in in the set.
Um we're going to be putting out a blog post about um GPT5's performance and it is a jump but it's still completely
unable to solve puzzles which are you know humans can can solve.
And what I really like about this data uh data set um and actually was the catalyst for me creating it in the first place it was that there was a there was
a quote by Andre Kapathy saying okay so we have all this data it's from the internet um but what you really want right if you
wanted AGI you wouldn't want all of the text that humans have ever created you would actually want the thought traces in their head as they were creating the
text, right? If you could actually learn
text, right? If you could actually learn from that, then you would get something really powerful.
And I thought to myself, well, that data must exist somewhere.
My first thought was maybe philosophy like uh you know there's there's a type of philosophy where you just write down your thoughts without thinking like just stream of
consciousness.
I thought maybe that could work. Um, but
then when I wasn't thinking about it and I was, you know, in my leisure time, I was watching a YouTube channel called Cracking the Cryptic.
>> Yes.
>> Where these uh these two British gentlemen will solve these extremely difficult Sudoku puzzles for you. Right.
Sometimes their their videos are four hours long and they they're professionals like this is their job.
And what was perfect I realized is they tell you in agonizing detail exactly what reasoning they used
to solve those particular puzzles.
Right? So we with their permission took all of their videos which represents thousands of hours of very high quality
human reasoning like thought traces and scraped them and made that available for
imitation learning. Right? Um we did try
imitation learning. Right? Um we did try to do this internally. Turns out that I did a little bit too much of a good job of really creating a very difficult benchmark. Right. So, we're still trying
benchmark. Right. So, we're still trying to get that stuff working and we'll publish it that if we if we have some success. Um,
success. Um, yeah, I want to I want to really sell the fact that this this reasoning benchmark really is different, right?
Not only do you get something that's super grounded, like you know exactly if it's right or wrong, so you can do RL to your heart's consent,
but you can't generalize very easily.
Each puzzle is deliberately designed by hand to have a new and unique twist on the rules called a breakin that you have
to understand. And right now, despite
to understand. And right now, despite all the progress we've made, the current AI models can't take that
leap. They can't find these breakins,
leap. They can't find these breakins, right? They'll fall back to, okay, I'll
right? They'll fall back to, okay, I'll try no, I'll try five, I'll try six, I'll try seven, right? The the reasoning becomes really
right? The the reasoning becomes really boring and nothing like what you see in the transcripts that we've we've open sourced from this from this YouTube channel. So I just want to put the
channel. So I just want to put the challenge out there right that this this is a a really difficult benchmark and I think progress on this benchmark will
really mean progress in AI generally.
>> Could you reflect a bit so after watching this um Cracking the Cryptic YouTube channel? How diverse were the
YouTube channel? How diverse were the patterns? Because um Chris was saying to
patterns? Because um Chris was saying to me, oh you know these guys they go on Discord servers, they get these creative crazy ideas and I'm I'm obsessed. Maybe
it maybe I'm just being idealistic, but I love this idea of there being a deductive closure of knowledge, right?
That that there's this big tree of of reasoning and we're all in possession of different parts of the tree to different depths. So the smarter and the more
depths. So the smarter and the more knowledgeable you are, the deeper down the tree you go. But in this idealized form, there is one tree and all knowledge kind of, you know, originates
or emanates from these abstract principles. And we could in principle
principles. And we could in principle build reasoning engines that could just reason from first principles and it might be um computationally irreducible.
So so you have to perform all of the steps. And it feels like because we're
steps. And it feels like because we're not in possession of the full tree. What
we need to do is kind of fish around. We
fish around to find Lego blocks. Oh,
that's a good Lego block. I can apply that to this problem. And maybe that's just what we need to do in AI for the time being is is is we need to just acquire as much of the tree as possible.
But could could we just do it all the way down?
>> Yeah, fascinating question.
That tree is probably massive, right?
>> And as a human is solving these puzzles, they're definitely learning in real time and discovering new parts of this tree.
And it's it's sort of a meta task, right? Because it's not just reasoning,
right? Because it's not just reasoning, you're reasoning about the reasoning.
And I don't think we can. We have that in AI right now. Because if you watch the videos, they'll say something like, "Okay, this looks like a parask or this
is a set theoretic problem or, you know, maybe I should get my path tool out and trace this this around." And of course
the professionals they do have this this already massive collection of reasoning Lego blocks as you say in their head. So
they'll recognize okay that type of rule usually needs this kind of Lego block.
It's actually fascinating to watch how good they are at just intuitively knowing where you know someone like me haven't solved as many needs to spend a lot of time looking around like okay
maybe I should try this or maybe I try this one. Um, but even they're not
this one. Um, but even they're not perfect. So, you can watch them take a
perfect. So, you can watch them take a certain kind of reasoning and start building up. Okay, maybe we should solve
building up. Okay, maybe we should solve it like this and then go and know that doesn't disambiguate it enough and then
backtrack and then go down another path.
Again, something that we do not see current AI doing when they're trying to solve uh this this benchmark. the the
tree is very big and I guess the phoggenetic distance between many of these motifs in in the tree is just so large. So it's so difficult to jump
large. So it's so difficult to jump between and and and I and I think that's why as a collective intelligence we work so well together because we actually find ways to jump to different parts of
the tree, >> right? And I and I think that's probably
>> right? And I and I think that's probably why the RL the the current state of the RL algorithms that we're trying to apply to this just isn't working because
in order to learn how to get these breakthroughs to to understand what the sort of nuance reasoning is to get these puzzles, you have to sample them. And that it's
it's such a rare space, you know, it's it's such an specific kind of reasoning that's required to get to the the specific breakthrough that this kind of
technique doesn't work, right? And
there's definitely a feeling in the community like, okay, this is how you just solve things now. Like we have RL, yes, we can get these language models to do what we want. It doesn't work for this for this data set.
>> Guys, it's been an absolute honor having you on the show. Just before we go, are you hiring? Because we've got a we've
you hiring? Because we've got a we've got a great audience of ML engineers and scientists and um I think working for Zakano would be the dream job.
>> That's very kind of you. Yes, we are definitely hiring and as I said earlier in this interview, I honestly want to give people as much research
freedom as possible. I'm willing to make that bet, right? I think things that are very interesting will come out of this.
And I think we've already seen plenty of interesting things coming out of this.
So if you want to work on what you think is interesting and important, come to Japan.
>> And Japan just happens to be the most civilized culture in the world.
>> All right.
>> It might be the opportunity of a lifetime, folks. So um yeah, get in
lifetime, folks. So um yeah, get in touch, guys. Seriously, thank you so
touch, guys. Seriously, thank you so much. It's been an honor having you both
much. It's been an honor having you both on the show.
>> Thank you very much.
>> Thank you so much. It's been great.
Loading video analysis...