The Mathematical Foundations of Intelligence [Professor Yi Ma]
By Machine Learning Street Talk
Summary
Topics Covered
- Intelligence Formalizes as Parsimony and Self-Consistency
- LLMs Memorize Language, Don't Understand It
- Abstraction Marks Phase Transition Beyond Compression
- Current AI Replicates Empirical Memory Formation
- Rate Reduction Enables Principled World Models
Full Transcript
In the past 10 years, I think the question about the intelligence or artificial intelligence has uh captured people's imagination. Uh I'm one of
people's imagination. Uh I'm one of them, but it took me about 10 years try to really understand uh can we actually make understanding intelligence a truly
scientific or mathematical problem to formalize it. You probably will get some
formalize it. You probably will get some of my opinion and also the the facts about it and probably will change your view what is intelligence is which is
also a very so certain process for me.
How do we clarify uh some misunderstandings the common misunderstandings about the intelligence? Through this journey uh
intelligence? Through this journey uh maybe we'll gain a entirely new view about what we really have done right in the past 10 years. The practice of artificial intelligence the mechanism we
have implemented the all the mechanism behind all the large models deep networks and the large models are truly are and uh the true natures and uh hence
understand their limitations and also what it takes to truly build a system that has intelligent behaviors or capabilities. I think we have reached
capabilities. I think we have reached the point right we'll be able to address what's the next for understanding even more advanced form of intelligence what's the difference between
compression and abstraction difference between memorization and understanding I think of future those are the big open
problems for all of us to study MLST is supported by cyber fund link in the description >> the idea of having to traffic and
squishy people in order to make our systems go is not immediately appealing.
Let's put it that way.
>> This episode is sponsored by Prolific.
>> Let's get few quality examples in. Let's
get the right humans in to get the right quality of human feedback in. So, so
we're trying to make human data or human feedback. We treat it as an
feedback. We treat it as an infrastructure problem. We try to make
infrastructure problem. We try to make it accessible. We make it cheaper. We
it accessible. We make it cheaper. We
effectively democratize access to this data. Professor Ma, it's amazing to have
data. Professor Ma, it's amazing to have you on MLST. Welcome.
>> Thank you for having me.
>> So, normally I ask guests to introduce themselves, but given your your stature in the field, I think it's best that that that I give you an introduction.
So, um, Yimar is a worldleading expert in deep learning and artificial intelligence. He's the inaugural
intelligence. He's the inaugural director of the school of computing and data science at Hong Kong University and director of the Institute of Data
Science at the University of Hong Kong.
He's also a visiting professor at UC Berkeley where he previously served as a full professor in electrical engineering and computer science. He's an ITE E
fellow, ACM fellow and CM fellow whose pioneering work on sparse representation and low rank structures has fundamentally shaped modern computer vision and machine learning. His
recently published book learning deep representations of data distributions proposes a mathematical theory of intelligence built on two principles
pasimony and self-consistency.
This framework has led to whitebox transformers known as crate architectures where every component can be derived from first principles rather than empirical guesswork. So um
professor mah tell me about your book.
>> You know about you know uh eight or seven years ago you know you know deep networks uh deep learning has been pretty you know changed the practice of
machine learning or artificial intelligence uh in the past decade. Um
about uh eight years ago I had a chance to get back to Berkeley and um gave me a chance to look into this uh topics more deeply try to understand it from more
principled approach and hence uh the book is kind of a summarizing the kind of progress we made in the past uh you know eight or beyond years myself my
group as well as many colleagues uh trying to understand uh the principles behind Um the deep networks explained it
uh from our first principle and uh in that journey we also seems to embark on um a little bit beyond that uh find uh something probably more general behind
that is the uh the intelligence at a certain level of intelligence and hence um when I get back to joined Hong Kong U
about two years ago had a chance to design or redesign some of uh curriculum to reflect those uh some of the progress updates in our field, the rapid progress
in our field. Um so my students and colleagues decided that maybe it's time to uh systematic organize uh this body of knowledge
and uh reflect as a as a textbook as well as a new course uh which I'm teaching this semester and uh likely will be offered as well at Berkeley next
semester. So this is actually a first
semester. So this is actually a first time probably we try to uh provide a more principle approach to explain the
the deep networks um as well as some principle of uh intelligence >> and these principles are panimony and self-consistency.
So it's it's an ambitious idea that these principles could explain natural and artificial intelligence. What do you mean by that?
>> Intelligence artificial or natural uh or whatever uh adjectives you add to intelligence uh we have to be very specific. It's a very loaded world,
specific. It's a very loaded world, right? I mean, even intelligence
right? I mean, even intelligence itselfves may have different levels um different um stages, right? So, it's
high time we clarify that concept scientifically or mathematically, right?
Um then so that we'll be able to talk about uh study intelligence the mechanism behind it at each level. uh
there are some more unified principle behind even different stages of intelligence there's there's something in common there also s are different so it's high time we we we do that one of
the intelligence at the level is that common to animals are human right human we are animals that int level of intelligence is what we think are very
common to all life which is how memory how we learn knowledges about the external world and then memorize as part
of our memory and use that to predict to react to the world help us make decisions predict and make better decisions with for survival and so on so forth that's very very common and this
is the level of intelligence we're talking about very much for the book as well right and um hence for this level of intelligence uh how our memory works
uh today we also have a fancy word for memory we call it a world model Okay. Um
and it's how this how we develop such a memory such a world model and how the model gets evolved and uh how we use it that's actually this is the level of we
talk. So we actually believe that uh for
talk. So we actually believe that uh for this level of intelligence for how our memory formed and how they work is precisely the two principles are
incredibly important. um and we believe
incredibly important. um and we believe they're necessary uh is that me memory or knowledge is precisely try to discover what's predictable about the
world. Hence um for all understand all
world. Hence um for all understand all such information are have intrinsically very low degree of freedom. we call the
lowdimensional structures and uh hence the way to pursue such knowledge is precisely through uh trying to find the most simple representation of the data
and uh hence compression denoising dimension reduction is actually all just different words to pursue such knowledge such structure and hence that's the word
captured by the word parimony finding you know explain making things as simple as possible but not any simpler right um so this is Einstein
says this word uh this the sentence Einstein used to describe science actually this is also what the intelligence at least at this level is precisely doing the same thing um the
second part of the sentence not any simpler precisely says consistent consistency make sure your memory is
actually consistent with be able to um recreate simulate the world just right um not any similar. If you're simpler,
you may lose part of the um the uh predictivity uh predictivity uh and also ability to predict it well. And so
that's actually the those two actually uh coexist we believe and those are two principles parsimmony and the consistency or subconsistency are
actually the two characteristic about how our memory works. So we want to have understanding which carves the world up by the joints which represents the
important invariances in the world and the thesis is I think that compression might be necessary for understanding. My
possible concern with that is that what we are doing with machine learning is representing extent examples of a long phoggenetic
tree of evolution. Mhm.
>> So to what extent does knowing their representation now help us? Do we also need to know how they evolved and where they might go in the future?
>> The the process to acquire knowledge to gain information about our side world that's a compression. Uh find what is
compression. Uh find what is compressible, what has orders, what phenomena has orders. uh has low dimension structures
orders. uh has low dimension structures that allow us to predict to rule out uh variabilities to predict world uh
tomorrows or predict world better in that sense. Um so in a sense that uh
that sense. Um so in a sense that uh that is ability we uh believe is really what intelligence is all about uh at least the the common intelligence we're
talking about right we can talk about the more higher level intelligence later uh um and if you look at the the history
of life how life was developed um um so we actually come up to believe right uh uh you know you know the the the mechanism that we laws that governs the
physical law world we call the physics right but what is the mechanism governs the evolution of life I think it's intelligence right even the process you
mentioned that uh through evolution and um life evolves um precisely they they learn more and more knowledge about the world and they encode them through DNA
to pass it on to next generation and that is a compressing that's a that's a process to compress logic that learned about the world through our
DNAs. But the the the the mechanism to
DNAs. But the the the the mechanism to update it is actually very brutal, very brutal force and um through you know
random mutation and the natural selection. Yes, it does evolve. It does
selection. Yes, it does evolve. It does
advance but at a huge cost of resource time and also very unpredictable which if you if you're acute you
probably observe there's some similarity with how current big model evolves right um many many groups try uh without
principle try and error empirical and um the lucky one survive and gets advocated uh you everywhere and become very very popular right dominate the practice so
so in a sense that you can make the allergy right I think to the people you know students ask me at which stage our artificial intelligence uh is at today
um then there's already a allergy in in nature right we are very much at the early stage of the life form right and so hence that is a compression process
that's a process that also gain knowledge about the world But of course later on we develop uh individual animals to develop the brain develop you know neural systems develop uh senses
including visual and touch and so on. So
we actually you start to use a very very different mechanism to learn to compress our observations to learn knowledge and
to build memories of world and even individuals start to have that ability rather than just inherit knowledge from their DNAs. So that's a different stages
their DNAs. So that's a different stages uh about uh and then that part of the knowledge is not no longer encoded in our genetics in our genes but also in our brains.
Um and that's actually that's actually a level of intelligence we talk about most of the time these days you know which is common to animals which is common to to humans and the le the knowledge the or
the intelligence we talk about what brain functions. Yeah, I mean I I think
brain functions. Yeah, I mean I I think we would definitely agree with the statement that intelligence as a system produces artifacts. So Shallay's example
produces artifacts. So Shallay's example is a road building network. It produces
roads and the system has adaptivity because it can create new roots where they weren't there before.
>> And then there's the question of well there are many ways to compress a thing.
>> So some ways of compression represent the world at a deep abstract level and some don't. So we might argue that LLMs
some don't. So we might argue that LLMs today even though they do compress the data they only compress it in a superficially semantic way.
>> And then there's this notion of well maybe we agree that intelligence is about the synthesis of new knowledge. So
it's the acquisition of new knowledge but we can only do that if the knowledge we already have represents the world at a deep abstract level. So rather than it being random mutations in evolution,
it's very very structured because the processes are physically instantiated which means rather than just doing something completely random, it's guided
by the process which created them. There
are a lot of uh confusions about uh uh what is the knowledge and also what's the process to gain knowledge, right? Um
for example uh many people says what all this language model is doing to the language. Um which is by the way it's
language. Um which is by the way it's very different. Don't forget our
very different. Don't forget our language is a results of compression is a res language is precise the code we
learned to represent the knowledge we learned through our physical senses about external world through billions of years of evolution or billions of years
of as as our brain how our brain evolve right it's a result of that our knowledge is that that actually represents knowledge and uh hence the
the the language is natural text. We use
language natural language to encode other knowledge as common to all people.
Now we're using another model another compression process to memorize it. Um
in a sense that that mechanism you can argue that mechanism what those large language model is doing are further
treat those text as raw signals to further through compression to identify their statistical structures internal
structures. what he's doing is actually
structures. what he's doing is actually not very clear is maybe will just help us to memorize them the the text as they
are and regenerate them right and it's not going through a process just like how we acquire our how our natural language was developed is through a very
um long and our language are actually grounded with our physical senses our world models as we know as memory Right. Our language is precisely try to
Right. Our language is precisely try to describe that uh it's abstraction of that world model we have in our brain.
It's a small fraction of the knowledge that worth sharing with each other.
Right? Far smaller than the actual model we have. We have many things in our
we have. We have many things in our senses in our memory. We have there's no way we can express them in words right.
Hence you can think don't forget right our brain there's small fraction process natural language but the majority is processing visual sensors and also motion senses data. So that's actually
says that how how how what the role of language is doing. Now people actually very much confused the the process actually now with natural language mode we used to process reprocess the natural
language or the knowledge as we know the human knowledge with common to human society we know this fraction of knowledge is actually very similar to what maybe we're trading our raw data
right hence you can see that even the mechanism we transformers or whatever architecture we're to do and we are very common to reprocess them uh for example
the videos, visual data or we treating languages as visual data. we tends to confuse right um the the mechanism we
actually probably uh memorizing or extracting knowledge from other senses with what we should be doing with uh confuse that that process with we
probably large language model is actually understanding the lateral text right so there's a very very fundamental differences right using the same process
to compress raw data to extract statistic correlations to stack low low dimensional structures. That process
dimensional structures. That process works for for for for the senses, right?
To build the knowledge, but we're actually applying that to our knowledge to our natural languages. We pretend
that is understanding. That's probably
not. You cited Max Bennett in your talk.
By the way, folks at home, you should watch um Professor Mar's uh talk on on intelligence and some of the ideas in in your book, and I'll put a link in the video description, but you cited Max Bennett, and I read his book, A Brief
History of Intelligence, and he also had this really interesting idea that that language is basically a set of pointers, and we're actually sharing simulations and second order simulations, and and in
a sense, those are just pointers to the simulations, and that's where a lot of the semantic content is. But you also spoke about levels of intelligence. So
it's a wonderful idea, isn't it? That we
have this phoggenetic information accumulation which is very slow. So you
know every single physical generation and then we have this ontogenic accumulation in our lifetimes and then we have social accumulation. So we have a big hard drive in the sky where we
kind of store these pointers and then we also do science which is very abstract where we we derive well we we just hypothesize about things and what's very
interesting about science is um you know there's always been this division between empiricism and rationalism. So
empiricism says if the idea is on the stage it's in my mind and this other idea is that sometimes we just conjure things which are not in the data. So
that science thing is very interesting.
>> Exactly. So see you just mentioned the four uh stages and we sort of uh elaborated a little bit in our book, right? How people uh those four stages
right? How people uh those four stages actually something in common. The common
exactly try to extract structures from the data and record them. record that
what's predictable and through compression through den noising through dimensionality reduction capture correlation in the signals and uh I use
that for predictivity uh for predictions and so on and that's in common hence for all four stages of uh of uh intelligence
the principle of parsimony the principle of self-consistency or consistence at work however through different mechanism right the different uh code book
different mechanism for updating for optimizing the information or even acquiring the information that's actually uh very uh important to know.
Now the the main uh thing you mention also it's a it's a point of confusion is precisely the
the animals or humans even social science at the stage of before science appear right almost you can see almost all the knowledge we gain are through
empirical approach kind of passive we observe and we uh we try we make some errors and we learn from our mistakes then we record that
right for example um Chinese medicine Indian medicine for many many years works right it's similar it's very similar to even beame similar to even
how DNA evolves we search in a sort of a less organized way and somewhat by chance some by accidents and we but we accumulate knowledge become very very
useful we understand how weather changes how planets moves in a very empirical way, right, for many many years. Um, so
and we share that we rewritten that in languages in text and pass those knowledge down to next generation similar to what DNA does to to to to the knowledge they learned through different
process.
Now there's a huge distinction uh we I call it you know the uh around the 3,000 years ago um we don't know what happened to be honest right uh we're able to maybe the whole process to acquire
empirical knowledge is just through compression but suddenly we be able to do somehow to do abstraction right um we start to develop knowledge
that's far beyond empirical observations uh which you think about it's very for example the the notion of numbers natural numbers we count up to 500.
But suddenly we realize this process goes to infinity right kids start to do that right when they in in middle school they start to everybody most people start to get that point and there's an
amazing thing is that about uh you know years ago that all this you know the yuclet formulate the geometry right one of the assumption I don't know if people
know this it's a very long empirical which says two parallel lines never intersect Why the word lever is actually
infinite suggested it's something you never would observe empirically right how did we come up with that idea how
what happened to our brain we are start to jump from compressing empirical knowledge identify correlation to something we actually formalize
abstract right hence in my book I asked in towards last chapter I ask is there a difference is compression order is probably not right. Abstraction is
definitely related to uh compression but there seems to be something different something more right you know the the
car proper right the famous uh you know science uh scientific philosopher uh said that the science is the art of over
simplification right um the ability to abstract uh we be able to hypothesize things and there is a distinction
between hallucilation and hypothesizing I believe but what is it right um and we know through compression we can memorize we can learn the data distribution we
can memorize we can find a good very good representation we can even using that representation to regenerate data with the same distribution we call it memorization but is there difference between
memorizing those data distribution than understanding it we can emulating how we conduct logic duration, right? Just like
now we can do supervised fine-tune or through train of thought to and also using the uh reinforce learning to force the large language model to emulating
memorizing how we solve logical problems, how we solve mathematic problems. But that is that solution is based on based upon understanding
logic understanding necessity of logic mastering the mechanism logic applied it or it's just emulating the process. So
we don't know. So we have a lot of question now is the difference between compression and abstraction is the difference
between memorizing and understanding right and it's kind of similar to when touring was faced the question what is
computable what is not computable we know there is a difference but how do you crystallize that or we also now know is P equal to MP right we know there's
might be a distinction. Can we formalize that question? uh if we believe there's
that question? uh if we believe there's no difference, prove it. or if there is a difference, how can we quantif qualitatively or quantitatively be able
to see what's more beyond compression that will take us to the level of abstraction and hence I believe this is hence this is the so we call the phase transition right from empirical
knowledge developing empirical knowledge to scientific knowledge and what is the distinction uh to me sometimes I call the last stage of you know the the the
the stage of intelligence as the true artificial intelligence. If people ever
artificial intelligence. If people ever bother to read the proposal laid out by the folks in 1956, you'll find that actually that's the level of
intelligence they truly mention they truly meant meant okay to to to to work on. But yet from all we understand about
on. But yet from all we understand about the practice in the past decade right we very much are reproducing uh the kind of mechanism that is at the
level of memory how to form empirical memory. In fact, I believe even the
memory. In fact, I believe even the large language model are precisely memorizing like the large volume of text
through how we form empirical knowledge, right? It's using the same mechanism how
right? It's using the same mechanism how we form empirical knowledge to memorizing the natural language. The
knowledges are encoded in natural language. Hence whether that it's
language. Hence whether that it's equivalent or equal to understanding that's a big question mark.
>> It it is tantalizing isn't it? Thinking
about where when when we when we come up with these new theories that don't seem to come from the data. Where do they come from? You could be a Platonist and
come from? You could be a Platonist and you could say they are just a gift from God and somehow or a nivist and somehow they're in our our brain. Or maybe we could subscribe to the idea that there
is um there's a kind of deductive tree, right? So there is the the tree of all
right? So there is the the tree of all possible conceivable knowledge and that represents our cognitive horizon. And if
only we could build systems that could um acquire that tree very abstractly.
And if we could design a compositional system that could creatively explore that tree, then somewhere in that tree we would be able to discover these abstract theorems. But then there's the
question of well why don't large language models do this now. So we see the arc challenge for example. And what
we see is that models are very very bad at doing abstract compositional reasoning where you need to take abstract things combine them together to to adapt to novelty. They don't do that
very well.
And in a sense, one one kind of um optimistic view is that these models are learning lots of factored representations and it is conceivable
that something like a large language model could do that. But the other school of thought is that it's it's just not possible because they don't have abstract enough understanding. What do
you think? I believe at least from all our understand current understanding about what the language large language model at least current architecture is doing is precisely using the same
mechanism that we extract empirical knowledge uh from the data the the the correlations the low dimensional
structure within the data to memorize the natural language right which to some extent representing the knowledge we have right hence Then the question is then the D language
model is precisely using the same process as we acquire our empirical knowledge to process the the the large volume of the natural language um from
the mechanism part I don't think it's actually needs to uh understanding now but in order to make that statement conclusive you need to know then what is what extra thing that leads to your
question right then what what means by understanding what might be might be able really have that deductive structure right so this is a big question to be
honest I think that's a question truly that scientific community needs to answer now to also to address or at least to discuss now and what do we mean by that right um so you know in a in a
in in the sense we have modern science there's always two schools of process that allow us to propel the science to advance one is inductive right uh we do
experimentally We observe and why is deductive right? Once we have accumulate
deductive right? Once we have accumulate enough inductive uh you know observations experiments empirical observation, we start to make hypothesis, assumptions, right? Axioms
so that then we actually be go through a very rigorous deductive process to derive what's implication of those assumptions then reach conclusions that is testable
that is actually measurable. Then
through the verification through experiment we can measure then we can either verify or falsify the the
original assumptions right that's a very powerful process but that process precisely relies on we have a very rigorous logic deductive system without
that right we cannot overthrown the original assumption right so that is very fundamentally deep um you know
abstract process So the way we have that in our brain we can argue whether or not animal has that or at what point we had this we reach this uh critical moment
then we'd be able to develop that ability our brain evolved to what maybe it's it's a some fa we go through some phase transition in our brain structure will allow us to identify those
structures platonic whatever you call it right structures and we are able to not only the original scientist discover
the logical, causal and abstraction but also other people also gain the ability to understand it.
That is actually quite amazing if you think about it right even the human beings through language we communicate but somehow we all at certain stage develop ability to understand to be able
to learn mathematics even this is discovered somewhere else but I can reach sim similar level of understanding as original discovery I'm also be similar similarly convinced by the proof
is rigorous the logic deduction is something necessary right not just some like natural language has has ambiguity in it. Right? So this
is something up for debate. Is this
really that you know um there's a god there that there also there's some invariant knowledge there truly you know ground truth there. Uh we don't know. Um
but definitely this the ability how do we allow us to reach that level ability I think that is truly the next we need
to understand what are the mechanism implementable reproducible mechanism allow us to recreate to have a
system to able to gain that kind of ability I think that will be the next stage of intelligence
So we'll be able to have artificial system reach the stage of human educated enlightened human right beyond as our
our humans as a just as a animal >> and do you believe in practice that's possible? I truly believe that is a part
possible? I truly believe that is a part of our brain um or function I think can be reproduced can be discovered
understood and even reproduced um but for now uh what exactly that melanism is I don't think we have much clue right we
know the afters just like you your example we know the road it build right the road building company or network right they can build a road we know. So
even the the today we learn we can learn logic we can learn mathematics but what is the mechanism allow us to create a new mathematical theorem create a new
scientific theory conduct the logic deduction right and understand it that what is that mechanism we don't understand we only know the art after
facts right the logic makes sense but why it makes sense that's still What is the mechanism in our brain uh that we
shared that the all this process all the deductive process makes sense to all of us and uh is still quite unclear.
>> The reason I asked the question is a lot of cognitive science folks they say that we understand because we're causally embedded and what they mean by that is we evolved with the world which means
all of the representations in our brain co-evolved with other things in the world and it's deeply rooted. So the
implication is that intelligence is quite specialized and you could say intelligence oh it's just the efficient search of the space of touring machine algorithms and yeah that would be
correct but it but it's describing the what not the how. It's it's so um it's almost trivial to say that it doesn't describe how we would implement
intelligence. Whereas if intelligence is
intelligence. Whereas if intelligence is the acquisition of knowledge, it must be domain specific, right? Because the road building company that we spoke about,
they can't build any type of road. They
are restricted in terms of the materials that they have access to and presumably different types of knowledge are quite different to others. So an intelligent
process might be able to acquire knowledge over here but not over there.
How specialized do you think intelligence is? Remember the road
intelligence is? Remember the road building company that's just a allergy is a special cases that's exactly sort of echo with what we mentioned earlier the intelligence has different stages or
even different forms but there are always something in common there right you know even the early stage how DNA evolves how our memory evolves how human
society logic evolves or how our scientific knowledge evolves right those are precisely they're different you can say there are different builds so some kind of building are building some are
building roads but the mechanism are common they're using similar um for example the concept of compression discover the the mechanism that's common
behind all this is that through discover what is structured what is not random and I think that is common through a you
know under the principle of parsimony or compression or dimension reduction or den noising to discover those structure that is common although The mechanism you know operation might be different.
The domain knowledge distribution applied to might be different. Some are
discrete, some are continuous. Some are
higher degree the intrinsic degree are higher. Some are lower. Some are
higher. Some are lower. Some are
simpler. Some some can be formulated as mathematical physical equations. Some
cannot, right? But they still can be learned, can be compressed, can be memorized through other mediums in DNA, in our neural networks, not in you know
differential equations. Right? So those
differential equations. Right? So those
actually there are things that are very common the mechanism behind purpose are common mechanis even the principles are common but the realization right the
physical realization could be different and the optimization mechanism could be different the code book we learn could be different right so this is something we need to understand um I I think once
you understand that a lot of it can help you understand we can understand a lot of things around us right what is common behind all the intelligent behaviors and
also each stages or each deforms may also has their own domain specific characteristics. I know you're a big fan
characteristics. I know you're a big fan of um cybernetics which came from the 40s uh with Norbert Wina and it described this cybernetic action loop
which is an agent which senses and acts in in this closed feedback loop and just you know based on what you just said maybe the mechanism of intelligence is
the same but certainly the action space is different. So, so different um
is different. So, so different um embodied agents in this situation would be able to do A, B and C and over here something different. So, could it be at
something different. So, could it be at least specialized in the action domain?
>> Those scientists, those pioneers that uh uh about those are the pioneers in 1940s, their interest in intelligence, but the level of intelligence I think
their interest is mostly at the animal level and the common human level. It's
about how our brain works. It's not
about DNA although they make some analogy but they very much specify how our brain acquire our memory build a world model quickly through perceptions
and and actions interacting with it to predict and make mistakes and learn from that process right so I think that's the stage they're study they may not quite
get to the advanced stage that we'll talk about science they care mostly about how our memory works the animal right to build autonomous systems autonomous machines so they can
emulating our ability at that level right so hence it's they call we call the cybernetic programs by the way I mean the point is that a lot of people um there are two things of course
cybernetics word was a little bit abused later right become just like artificial intelligence actually become very bad word for certain period of time right I think it's probably and also the many
people it's understanding cybernetics in a very narrowminded they actually think it's just about a control Actually it's not uh if you read the book right um and uh by no winner he actually
characterized is a book actually characterized all the at least the necessity necessary characteristic what intelligence system the system at the
animal level should have to me how to record information he actually needs to information theory how do we correct error which is the feedback control how do we improve our decision- making he
actually mentioned the game theory right and and so on so forth. He even uh discussed the necessity of addressing nonlinearity which actually explain why
our brain has waves as for his but it's very curious he's definitely interested what are the necessary characteristics for a system to be intelligent
although he may not he for short to how how they should be put together but definitely captured some of the essential characteristics a autonomous intelligence system should have which
surprised It's got forgotten by our past decade of practice of artificial system building artificial intelligence systems right
and don't forget those are necessary characteristics that you know and at least those pioneers are convinced a intelligence system should have um so this is actually something I think you
know we should probably learn a little bit lesson from our history >> can you sketch out the journey from information theory to your maximum um rate coding reduction system.
>> This is actually a very interesting question. I'm not actually honestly I'm
question. I'm not actually honestly I'm not information theorist, right? So I was trained as control theorist but early when I was a graduate student I also do communication. I took a lot of u um
communication. I took a lot of u um communication random process information theory course uh although I didn't end up with doing it. Uh so for for for many years you know this has become um
something I never really quite practiced until about a few years ago I was studying uh what deep network is doing and also get
come across what my intelligence is all about realized that maybe intelligence is trying to learn the common mathematic
problem behind at least the learning uh knowledge at the level of memory is probably pursuing a low dimensional structure or low dimensional
distribution of high dimensional data.
Okay. So once that became clear to me then it becomes very soul searching process for me. Now
if low dimensionality is the only prior the only so-called quote unquote inductive bias or assumption we can make from there. Can
we deduct can we deduce everything from there? No. But if this is some only
there? No. But if this is some only thing we can use then what's special about it?
It's the dimension very low. Hence the
volume of the data should be very low.
Then there's actually comes a lot of technical u uh challenges is that then how do you differentiate if I have a set of data if there are two
models both are low dimensional can equally support the data explain the data then which one do I choose right
so the challenge here is that if two models are both low dimensional then the volumes are are all zeros And I have a example in my book, right?
It troubles me a lot, right? I have
eight dots on a line. One is evenly distributed eight dots. Another one is, you know, four dots are clustered.
Another one. And the interpretation is very ambiguous if you think about it.
What's wrong? I treat all as just eight dots. Each one occurred one,
dots. Each one occurred one, you know, or it's probability 1/8, right? Nothing wrong with it. Or when
right? Nothing wrong with it. Or when
should I say they all lies on a straight line right but a line already has one degree one dimension higher than dots right then also what's wrong I say those
are eight dots line plane right but they're all zeros zero dimensions right so hence there's a question is how do we measure
the volume of the data right if you want to compress you have to have a more generalized notion of volume to measure the space spanned by the data. So that
come forced me to come across the concept of entropy. But entropy also is kind of limited because precisely does not quite differentiate those kind of
interpretations. Right? If you think
interpretations. Right? If you think about there are all zero dimened one dimensional distribution then the differential entropy is negative infinity. You are comparing you know
infinity. You are comparing you know infinite with another infinite or zero with another zero. So hence that we sort of actually come across maybe a lossy
coding right is just like Shannon developed after information theory he actually mentioned that when we actually do coding we do redistortion we do not see coding right that actually become a
sort of magical source allow us to now we have a more generalized much more general measure of volume for data in arbitrary space right and the the the
measure the support can be actually degener differentiate one low dimension model against another one, right? And
that becomes a measure. We can weight the data to differentiate different models about different models. Now allow
us if we compress this the coding rate we compress uh those coding rate will allow us to pursue those distributions in a very high dimensional space. In
fact, the very popular diffusion denoising process precisely doing this they're den the noising process of precisely reducing the entropy so that
we pursue a representation that is lower dimensional lower entropy in the end converge to the distribution of the data. This is the first stage. Now for
data. This is the first stage. Now for
memory it's not just the pursuing find where the distribution is right. I I use the analogy right? So uh the find where the distribution is it's also about
organizing it right a lot of people says oh you know learning is about compression and hence you know why don't we do all the way to commograph complexity no you don't want to do that
first of all complexity is not computable second of all we all know if you really manage to compress your data complexity the the codes will be random
basically the program itself specify the data will be rendered We don't memorize a bunch of random numbers in our brain, right? Hence,
actually our memory in our cortex is highly structured right different type of objects is very well organized in it to cortex and our spatial understanding
is very well organized in hypoc campus rightly structured because the structuriness allow us to access because we want to access the logage repeatedly.
We want to use it very efficiently.
Hence the logic has to be we use it under very very different conditions.
Hence I said you know our brains very much doing bas inferences. Once we learn that distribution when we organize the we transform the distribution to a very
structured and organized way. So the
maximum entropy you can think the maximum rate reduction is precisely reflect that necity. You do not just by reducing the coding rate to find where
the distribution is. You also want to transform it in a new representation such that the rate reduction gets maximized. The reduction gets maximized.
maximized. The reduction gets maximized.
Hence the representation subsequently become structured and organized. Then facilitate efficient
organized. Then facilitate efficient access and you can access memory under all types of conditions allow us to do all types of condition prediction
generation and estimation. Right? That's
the essence. Hence reduced entropy reduced entrop reduce reduced the coding rate and then maximize that reduction of
coding rate. They are actually reflect
coding rate. They are actually reflect two related process for building a good memory. First of all the the manifold
memory. First of all the the manifold hypothesis comes to mind which is this idea that all natural data falls on some lowdimensional you know structure you know with a low intrinsic dimension. And
the other thing that springs to mind is I mean I'm a fan of geometric deep learning which is this idea of you know we should imbue inductive prior in the
system which represent symmetries and and geometric structures in in the world and and I think as a principle that's deeply embedded in in this idea.
>> Exactly. Um if if you look at my my whole life I have written four books right. uh my early interest is studying
right. uh my early interest is studying computer vision. Um and the first book
computer vision. Um and the first book is division and from that work I study multiv- view geometry from that work I in my whole book the the all the four book is
actually about the one theme I realized that it's about structure in the data especially all reflected in the first book of street vision um the last
chapter I realized precisely about what importance of symmetry plays in our perception right and We perceive object
we naturally has a so human being we remember our vision we already recognize a long time ago right this days people always oh it's
about the vision is about recreated 3D absolutely not right all the people says we just multiple images we create a whole point clouds mesh sign distance
function lur gi splatter right we recreated scene see I can see from multiple angles Is this 3D understanding or will you
create some videos like Sora look at it looks good for you know absolutely not right this is not the representation
or understanding of a world model right our understanding is far beyond getting a whole bunch of you know point clouds or gshian splatter we can view from different angles
right have noticed that when we see something get we get excited because we understand the 3D we understand the content we parse it already in our brain
but the machine has no idea what the heck is in there okay it's just a bunch of point clouds it's a depth map okay we when we see this change angle we saw 3D
we already automatically recognize this is a hand this is a body this is a cup this is a this is apple right we do that we fill in that information with our brain we think machine once they can
recruit a 3D they all don't understand that this is a completely wrong okay many many work they say we're building 3D model by creating something for people to look at that's complete
purpose so look at our vision our vision model right we have hypoc campus we have our ID code is highly structured we understand the relationships
between our view centric objectentric and alocentric repetition right neuroscientist understand this very very well right scientists understand this
very well but not computer scientists right not computer vision scientist um some do see for example I give you example right in order to do spatial no we actually did a test with you on the
corner about a year ago test all the you know top multi-model models you know huge models GPT you know that the GMI
right to do very simple test the the title of the work called a eyes wide shot Right.
And it's a very simple test. Just test
the given images for the those language model or a big model multimodel highly trained highly commercialized models that do they understand the reasoning spatial reasoning. What is on the left
spatial reasoning. What is on the left of something?
How many objects there is in space, right? What is behind something? What's
right? What is behind something? What's
on top of something? Very simple
spatial. The question requires a little bit not even very deep spatial understanding. But all the models fails
understanding. But all the models fails miserably and the majority of them actually even worse than random guess right only I think only Gemini and uh
GPD uh is a little bit above human uh a little bit of about the random guess okay far below human understanding right so that's that's the status right if you do those meaning that you know those
actually 3D understanding is very very difficult but human we do this effortlessly right so I can easily point to you Right? Please hand me the the the bottle
Right? Please hand me the the the bottle to your left. Uh if you want to find, let's say, a shopping center, right? Say
go to the door, turn right, go to the once you get outside of the building, um head south, right? So through this this remember just through this simple
sentence already we switch from view centric to objectentric to alocentric. Right? So
our if we don't have this kind of model this kind of highly structured 3D models forget about you know people talk about embodied AI or world model right we
cannot conduct this very simple spatial references interacting we have this world model is be not to visualize we build a 3D model
to interact to manipulate to influence right we're not building a 3D model just to oh This looks like I can change my view. Look at it from this view or that
view. Look at it from this view or that view, right? Just visualize. No, turn
view, right? Just visualize. No, turn
360 degree to visual. No, we don't do that. That's not our purpose. Okay.
that. That's not our purpose. Okay.
Unfortunately, but this, you know, it gets distracted for that kind of uh visualization, it looks cool, but uh almost to to to us, if you really work
on robotics, uh for all cut, you know, navigation, local motion manipulation, they're actually pretty uh the usage is pretty limited. I won't necessarily say
pretty limited. I won't necessarily say they're useless, but they are actually pretty limited. Yeah,
pretty limited. Yeah, >> we should introduce the coding rate formula. I did have a question about
formula. I did have a question about that, which is there is a there's an epsilon on there. So there there's a bit of a question of how how do we tune that and and what does that mean? We should
also bring we've been talking about this a little bit, this um this concept of an LDR, so a linear um discriminative representation. And just more broadly
representation. And just more broadly with these inductive prior there's always the question of when we do abstraction to model regularities in the universe
there's always a little bit left over isn't there so to to what extent can we think of these things as natural >> actually you touch upon about the deep stone you touch upon a very very deep
question uh we actually uh it actually took me almost 30 years to understand it to be honest right we did mention that early on when we do try to differentiate
different measure different volumes. It
turns out lossy coding is necessary.
It's not just something that uh is something hacky. It's actually turns out
something hacky. It's actually turns out to be necessary and to do lossy coding.
In fact, we recently we start to realize that noise actually be you know plays very different roles and uh yet it's very confounded very confusing to many people even um this is
something actually my students will realize we actually probably will have some papers about it. I can elucidate this a little bit. If you think about the whole diffusion denoising model, right? People are very popular right now
right? People are very popular right now to do um why do we add noise to data right and to the whole world? Because we
don't know where the distribution is, right? So there is a phrase everybody
right? So there is a phrase everybody knows all roads to Rome, right? So why
is that? Has anyone give a thought to why all roads to Rome?
Because very simple at some point in history Rome builds the road to reach the whole world right that's a diffusion process then if you want to know Rome then you
do the den noise you follow the same way back you get to where the no room Rome is so that's the no dimeial structure that's where the knowledge is right
that's the osis is right so hence it's a very natural process that we add noise is adding noise to the data is precisely we building roads and the noising brings
us back remember where we come from and so on and that's a big deep slope. We
have to add noise to reach the whole earth. There's another actually there's
earth. There's another actually there's another noise right remember we only have isolated sample even we talk about manifolds
right but how many points you have on the manifolds how many points you observe they're always finite right but why do you call it a continum
why do you collect dots as lines planes surfaces when do you do That hence noise plays another role within the manifold. Even you have finite
the manifold. Even you have finite samples.
If you allow lossy coding, if you allow packing spheres in that things start to connect. You start to connect. Noise is
connect. You start to connect. Noise is
very important to help to connect the dots. Right? We all know the phenomena
dots. Right? We all know the phenomena of percolation, right? We see raindrops on the floor. You only see two phases, right? One phase is all the dots are
right? One phase is all the dots are isolated. another face is all things
isolated. another face is all things gets wet. You never see anything into in
gets wet. You never see anything into in the middle because there's a sharp face transition. Once the the the sphere once
transition. Once the the the sphere once the dots the density gets high enough they collects everything right. Maybe
that's a phase transition we reach. We
realizing a connected plane is a better solution to explain all the data more parsimmonious more economic
the cost to memorize all the dots versus memorize other plane start to switch maybe abstraction has something to do with that. I don't know but from a
with that. I don't know but from a compression point of view this can already allow us to explain when do we go from zero dimension samples to prefer
a low dimensional manifolds and also how go from that low dimensional manifold to reach the rest of the world right so you can see even in this process noise is already playing
this epsilon plays different roles and as at some point they gets collected right around the surface and that's we still try to figure out what happens but at this big two phases uh we already
know right the role of epsilon actually plays different roles right um and I think the the definitely in the past many years uh our understanding about
the subject how do we compress how do we uh pursue the low dimensional structure from finite samples it's a quite our understanding about this problem has
truly advanced dramatically I'm very happy honestly this is a question baffles being as a graduate students uh you can see even my early work about the lossy coding lossy compression reflected
my bafffulness about it and I'm really feel very uh thrilled that I recently start to understand those things in a more unified more not only theoretical
way but also even agorithmic way >> yeah it's so fascinating that we can look out the window >> and we we ignore so much detail we don't look at the leaves on the roads we we just find that structure and that's why
when I watched your presentation. I was
very intrigued when you said that denoising, iterative denoising is is a form of of compression.
>> I wanted to mention your ICML 2024, so last year it was in Vienna, right, with with Wang. And you found that when you
with Wang. And you found that when you have loss surfaces using using this technique, they are dramatically different. They're very smooth. There's
different. They're very smooth. There's
no kind of harsh local minima and so on.
What's the intuition for that? In fact
the phenomena our understanding about those phenomena is actually going back when we early days we study sparsity um you know when your data lies on very
low dimensional uh sparse surfaces uh planes right low dimensional planes or planes or low rank matrix right and in there we learned a very big lesson the
object function to evaluate those sparsity or low dimensionality uh those function are highly nonlinear non-convex Um but yet you know traditionally in our
orthodox understanding about the l convex optimization is always hard right and in the general classes and p hard and there's a lots of local spirus uh
local minima you get stuck with local minima you there's some stagnant critical points flat surface so basically the worst picture is very worse right it's a nightmare
uh but through the studied of those low dimensional structure sparse structure and that's what's actually featured in our my previous book right high dimensional a low dimensional structure of high dimensional data analysis. We
actually realize that if a lot of long convex problem even the optimization problem have long convex landscape if those problem or if those measure arises
from nature very natural resource those structure actually are very highly regular highly has symmetry.
the landscape actually are extremely benile.
Right? Quite contrary to our common understanding about the long linear optimization at all. Right? This is
complete 180 degree uh flip of views. In
fact, even the higher dimension helps.
The higher the dimension the better. We
call it a blessing of dimensionality. So
those regularity, those symmetry will tell us the landscape of this object function are actually beautiful, right?
And uh first of all they're highly regular there's no stagnant there's no flat surface there's no too many spirits local minima and even the local minima
they already have very clear geometric or statistical meaning and hence those landscape are very aminable for very simple algorithm to find the optimal
solution such as a graded descent which in almost indirectly explain why even we're doing gaz even more than 20 neural networks and know many more we're
searching uh low dimensional distribution in very high dimension spaces but somehow grad always end up with somewhere nice okay yeah fine you can run a long time but somehow you
always end up those landscapes are not that hard to traverse traverse right so it actually be precisely because those
object function are highly regular hence now get back to the rate reduction object function Right? If you look at the object function, it's not something
arbitrary, right? It's counting the
arbitrary, right? It's counting the volume of the whole minus the parts.
Right? It's something extremely objective, right? It's not like a loss
objective, right? It's not like a loss function people come up with randomly.
I'll add this term weighted sum at different weights. You know, if you use
different weights. You know, if you use this, you know, sort of a empirical penalty or empirical even some kind of ad hoc. So all the terms are describing
ad hoc. So all the terms are describing physical volumes of the data right.
Hence you should expect those are the quantity arising nature. Then from our lesson we realize indeed actually that actually those object function has very benile landscapes uh even the local
minima not only the global minima corresponding solutions that are give you a orthogal subspaces even the local ones right they are not the global optimal ones they have similar geometric
structures right and there's no other weird critical points that will slow down the search for those minimas so that's actually quite interesting So you can
see this this revolution allow us to understand right where maybe intelligence is precisely exploit and harnessing those things. So there's
actually a misunderstanding about you know last 10 years when we understand intelligence more and more there is a very big misunderstanding about intelligence right in machine you
you you you study you know maybe machine learning theory right we have a tendency to believe intelligence especially intelligence and nature is designed to
solve the hardest problem the worst case I actually beg to differ intelligence is precise ides the ability to identify
what is easy to address first, what is easy to learn, what is natural to learn first. Then only when that has been done
first. Then only when that has been done and the resource permit, they start to get into more and more advanced tasks.
Not everybody needs to learn advanced mathematics to supply. Animals don't.
Right? Nature find what is the easiest things with minimal energy, minimal effort to learn the most logic so they
survive the best. Right? Again this is the principle of parimony at play.
There's another level of resource parsimony at play here. Right? Hence
once you realize this you realizing understanding intelligence should be really understand what's really most common.
Right? the low dimensional structure most easy ones smooth ones benile distribution easy to get get away with a few samples fewer samples right uh and
very easy to formulate in fact that's what how science progress you know a lot of the physical models you know Newton's law they're very simple they discover the simple ones then we reach gradually reach to generate relativity and then to
quantum mechanics those equation gets f more complicated later right so this is the same process we identify what is the most common first right what is the most
easiest task first hence we don't want to many of the machine learning theory try to tends to so derive a bonds for the worst cases I think that's we
probably should think twice right >> I love that character characterization it's um similar to the least action principle in in physics >> exactly >> in in a sense we solve problems by taking many many steps in different
directions I think we still leave a little bit of entropy open we don't do pure hill climbing but collectively we acquire these stepping stones and the totality of that process as we solve
very complex problems. But I wanted to touch on you raised a very interesting point which is um we notice that when we have very large deep learning models
they tend to almost self-regularize and they they learn better and there's this phenomenon of double descent and and all of this tell me about that
>> fascinating question actually you are this question actually needs to um really u bring me back to early days when I tried to understand deep learning um when deep arrived There's a lot of
phenomena we try to understand. I'm I'm
one of those right try to understand those phenomenas u you know some something there's some good about dropout something about uh you know thresholding different thresholding
there's something about uh normalization there also come to somehow the model are very big um and parameters a lot somehow
the deep networks has uh do not have a tendency to overfit somehow they still generalize okay right and then of course people realize that there's a sort of uh
uh rather unlike the you know the traditional classical bias um virance virus trade-off but there's tend to double descent
I actually wrote a couple papers about it and um about the normalization about everything around 2000 um or late uh tw
uh 2019 I really told my student we should stop not to explain those isolated phenomena we only see where like the blind man to elephants each one say a little piece
each theory try to explain a little bit I think there should be a total total explanation to this if we get a big picture all the just the consequences or
or implications of that suddenly you know at that time we start to touch upon the the the concept of maybe that the
the process deeper are optimizing something the layer wise is realizing optimiz they're optimizing object that promoting parsimmon promoting no
dimensionality. Once we realized
dimensionality. Once we realized actually I was quite thrilled so then I told my student from now on we will no longer write any papers or about
overfitting. Why?
overfitting. Why?
Because if the neuronet networks is try to the operator is try to compress try to realize certain contracting map
compressed volume then you will never overfit right even I over parameterize it will never overfit simple example if I have data lies on a straight line a
one dimensional curve whatever I can embed this one dimensional line in a two dimension three dimension or a million dimension But if my op operator is
always layer-wise at each iteration, my operator is always just shrinking my solution towards the line in all directions.
I would never know of it. Even I
overparameterize embedded the line into a billions of dimension. I have a billions of parameter. But collectively
all those billions of parameter are all shrinking my solution, pushing the solution, den noiseise it, compress it towards the line, right? like a power iteration just like a PCA right power
iteration is in irregardless of what the dimension embedded computing the first singular values right it's always powerful it's converge with the same
speed you never over so compression by nature if the operator are performing compression or denoising which means this process will no longer overfit
anything right if you conduct it right if you converge the solution will converge on the the structure you desire for.
>> That raises a natural question. We were
interviewing Andrew Wilson from NYU and he's got this, you know, several papers about implicit biases where you kind of have a combination of, you know, hard biases with symmetries and and
everything in in between. And if what you're saying is true, then why do we need inductive biases at all? Could we
not pair back a little bit and just have really big models?
>> No, I see. See, this is the thing, right? Exactly. So, this is the thing.
right? Exactly. So, this is the thing.
Um um I was you know early on people don't understand deep networks and there's a lot of empirical uh trial and error people try to tends to use this
the phrase inductive bias to either as some kind of magical sauce that either explain the failures or success of you do certain way to the neuronet networks
design or how do you train the neuronet networks to be honest for for a long time I never understood what the inductive bias is um and maybe some regularization company is learning some
structures about network about the data.
Um but nowadays in my recent work I said probably from at least from what I understand um all the inductive bias should be uh formulated as first
principle right at least from we were able to for example deduce uh all the different network architectures including the recent white box crate or
transformer like or redunet like uh reset like architecture or mixture of expert like architecture all from the only inductive bias
is assuming your data distribution you are pursuing are low dimensional. Okay,
you can already get the form the the main architecture or form of operator of for each layer as a rest structure mixture of expert structure and and
those operator per layer are precisely conducting den noising compression or contrasting. Are there additional
contrasting. Are there additional assumption you can make? Yes, you can.
For example, if I my job is not just compress the data as it is. I also want to induce I wanted to for example in object recognition. I also want to
object recognition. I also want to enforce make all data I want my classification to be translational environment which is symmetry. Right? If
you allow my task will be environed to certain group action I want to compress them together. Voila what do you get?
them together. Voila what do you get?
You can then you still compression then you get a convolution naturally as the structure for the compression operator.
So convolution is not what we impose upon it's actually results from the first principle the the quote unquote inductive bias assume you want to
compress your data also you want your compression to respect translation environment or rotation environments that's the result that is the characteristic of the compression
operator for you to achieve that task right so there's a lot of um so we don't want to build in the the the inductive bias while we are searching for the
solution. The inductive bias in my
solution. The inductive bias in my understanding is should be the very assumption we make in the very beginning. The rest should be deduction.
beginning. The rest should be deduction.
The rest should have no induction anymore. Otherwise, we're doing trial
anymore. Otherwise, we're doing trial and error. Right? the inductive we so
and error. Right? the inductive we so basically when we build a theory we should have done all the inductive observations experiment and assumptions
already the good theory should start with the very few inductive bias or assumptions or axioms then the rest should be deductive I call that first
principle >> we've been speaking about pasimony which is what to learn and self-consistency is about how to term and we can sketch out
a journey I suppose from control theory to learning and and also this this methodology has some interesting um results I think around um you know the continual learning problem so let's
sketch that out >> you can see right so the the compression um um or the the even the rate reduction to try to pursue the data distribution
and also transform it and that's the one way direction there's almost there's no theoretic guarantee either your data is
sufficient to identify that some you know you may started with very very few samples there's no way the data is sufficient I mean the apple there are five types maybe I only say four types
right so and then but that process goes on you do compress what you have and the reach memory right and there's no guarantee the even during that process
you may not get stuck maybe there's not enough iterations You may so memory you get may not be accurate may not be correct. Hence, how do we check? How do
correct. Hence, how do we check? How do
I further develop, evolve, improve my memory or make sure my memory actually be able to authentically predict this is a worth model the model is accurate,
right? Uh so you actually have to decode
right? Uh so you actually have to decode it. You can think about the memory
it. You can think about the memory formation is an encoding process. Then
from my memory, you want to decode. I
want to predict what's going to happen next second from what I observe right now or at night I may want to dream what happen right so that's actually the decoding is actually allows us to check
if my memory serves to be right right how accurate I can predict next step and so hence this actually already form a sort of autoenccoding framework
now of course autoenccoding if I have access to both the observation and my memory just like our training our big data models I have control on both ends
I can just force autoenccoding back end to end right the people like to talk about it but in our natural um in a natural setting in animal or human
setting we don't have control on both ends right we only have control probably a control of our own brain what's inside our brains right we never really quite
have access to measure if are my prediction of the 3D world for example Right? The the the frame of the picture
Right? The the the frame of the picture is rectangular.
Do I ever measure it?
Right? You don't have to measure it. But
somehow everybody believe you know the model is correct. How do we do that?
Right? Hence there's actually a self-correcting process. In fact, you
self-correcting process. In fact, you know, this is actually probably the idea goes the idea actually goes back to Noble Veiner, right? And how animal be able to correct its error without see cats can capture something very
accurately. even make one single
accurately. even make one single mistakes they can correct that very clearly. Right? So somehow they they're
clearly. Right? So somehow they they're be able to build a world model very consistent self-consistent with the world without actually physically measure their errors. Right? So hence
this is idea about you actually loop it back to your brain and close the loop.
Right? and allows you to constant predict and based on your prediction and your observations check if there's a difference still difference between your
predictive prediction and your observation within your brain very if there's error and using that error to correct turns out um this is a
work with my student turns out you cannot do this of course the our observation will lose information right why uh where introduce noise or lose lose dimension, lose information. But
turns out as long as the distribution of the world, the data, the distribution of the world is low dimensional enough.
Even your encoding process, your observation pro perception process is no, you this is still doable if the precisely when the distribution of the
data outside world has enough structure, it's highly low dimension and hence your brain has enough degree of freedom to discern any differences. So this is actually quite interesting revelation to
for us to realize that the no dimensionality is not just some or technical assumption it's actually necessary for this kind of closed loop learning to be possible and once you be
able to close loop then you actually constantly observe constantly predict hence you can constantly use your memory to predict and correct it hence support
continuous learning even lifelong learning right our memory Rome is not built in one day. Our memory is never built in one day. We constantly improve
it, constantly revise it. And this is the mechanism of intelligence.
Hence, this mechanism is itself is already generalizable. Hence, you don't
already generalizable. Hence, you don't need to add the adjective general general in front of intelligence.
There's no point of calling general intelligence. If you implement the
intelligence. If you implement the intelligence mechanism correctly, it's already generalizable. The knowledge
already generalizable. The knowledge learned by this mechanism at any point of time may not be generalizable. The
mechanism does right this is a very big confusion. We think if I accumulate
confusion. We think if I accumulate enough knowledge it's generalizable. No,
it's not. Will never be. Any scientific
theory by definition being scientific is a falsifiable which means it's limited, right? Can only explain the world up to
right? Can only explain the world up to certain point or certain accuracy.
There's always room for improvement. The
scientific activity, our ability to revise our memory to acquire new memory, that is a generalizable ability. That is
intelligence, right? Through natural
selection early days, through our feedback control, feedback correction, through the human history of trial and error, imp accumulating empirical
knowledge, through scientific discovery is all doing this, right? That is common behind intelligence, right? Not the
memory accumulated up to a certain point. So even we manage to memorize the
point. So even we manage to memorize the whole world the knowledge we have in the whole world we will no longer be able to apply when we find oursel in a new environment in a new situation observe
some phenomena we have never seen before right hence that's the limitation of you know you try to gain general general intelligence through just accumulate enough knowledge right
>> we should talk about your crate series of architecture so crate stands for coding rate reduction transformer and you made some very interesting discoveries So for example multi head
self attention can be derived as a gradient step on rate coding and also MLPS as spifying um spification operators and and also you were talking
about how something like a transformer could be described in a principled way.
There there's this interesting thing, isn't there, that we we we designed the well, we didn't even design them. We we
kind of empirically tried with lots of different things and we happened upon the transformer. But something like that
the transformer. But something like that can actually come about from a first principles approach. If you look at the
principles approach. If you look at the past decade or so evolution of also it's kind of naturally selection process for
the big models right from early days Alex let l Alex letter VG or um then reset or uh transformers by the way this
is just one of those for survivors right as I said just like lateral selections right remember don't people don't forgot there's a there's a time there's a very
popular areas called Autoas, AutoML, right? People tends to do random search
right? People tends to do random search for better architectures, right? Somehow
why only a few survive? There must be a reason, right? They must capture certain
reason, right? They must capture certain structures. They must did something
structures. They must did something right. Now from our understanding so
right. Now from our understanding so far, the resonate actually capturing the the fact that each layer should be doing compress doing optimization.
The resonance precisely reflect the iterative optimization architecture.
Right. And
precisely capture fact. We're trying to cluster compress what's similar and discern or classify what's different or
contrast what's dissimilar, right? And
you want to develop different experts, right? We call them experts, we call
right? We call them experts, we call them as cluster, we call them group. So
be it, right? And the transformer again, right? Let's capture what is the
right? Let's capture what is the correlation self attentation. is
precisely compute what's what is the correlation in the data coariance in the data what's correlated and using that to further spify further classify things to
organize the the the distributions they must do something there some somehow close to something right right so also it's almost like a belief for us right if we believe there's something right
then we should be able to derive create from first principle have a very clear unified understanding I think we're sort managed to do that at least. So for the
not structure we discover so far provide rather unified explanation to what they have done to be honest the early even maybe our early earliest mo motive is trying to explain to understand what we
have done but once we understood it we realize that we can go much further right and realize even the current architectures there's a lot of room for improvement not only we can dramatically
simplify them you can see in the past after the create um in the last yes last year and this year there's a series of work from my group right really just showcase people right uh you can
actually you once you understand what is done with the principle you can dramatically simplify you you can even throw with the MLP layer if you only
care about the compression um you don't care about the final representation and or you can make the attention head since we know what is optimizing it's
optim optimizing the rate reduction object function Then we can find what is the equivalent variation form of that object function which is much easier to
optimize. We end up with a we call the
optimize. We end up with a we call the toss right the the computing the coariance the the self tension step is only linear in the dimension no longer quadratic
like the current tension is doing. Of
course if you look at the literature there are other people have found tried to identify linear complexity such as uh
you know manga or uh I think there's the rw vk or something so empirically but again it's through trial and error but this is so now we derive this in the ma
purely mathematical way because we just find the equivalent variational form of the same objective function they have the same global optimal but it's just much easier Optimize. This is a trick we
do all the time, right? All the tricks.
See in the 200 years plus years of developing better optimization algorithm, all those ideas can help us now to design better operator descent
operators or optimization architectures to improve the design of current architectures. Pro honestly, we have not
architectures. Pro honestly, we have not really started that far, right? There's
many acceleration techniques, preconditioning, conjugate gradient which explore different landscape. Once
we understand the landscape, the type the cost of object function better there's gazillions of ideas, we can further improve the efficiency.
Honestly, we haven't started that far, right? Um I mean that's actually what
right? Um I mean that's actually what got some of my student uh excited uh to to pursue this realizing how much how little we have done. um uh from
optimization perspective how much room there might still be for improvement uh some my students are quite excited. So
we can see you know even within last couple years we only have two two or three different uh uh uh uh generation of architectures that um in the past
it's almost unthinkable because new generation always come from different group right it's like a random process whoever gets lucky maybe discover something works try hard enough get something to work
>> it's a tantalizing idea though that through this principled optimization there could be you know a convergent evolution towards the optimum architecture >> then the search will no longer be random
will be actually guided right there's still just like back to your earlier suggestion right this is become intelligent search is you know guided search we understand the the structure
of the problems now and hence we can do science now we are no longer just to do empirical inductive search process why do open AI they're still using the
transformer even though there are now superior architectures out there and we should talk about this token statistics um transformer. So as you just said it's
um transformer. So as you just said it's a linear time complexity which means in principle this is something which is going to scale dramatically better than the kind of transformers we're using now. So why aren't we using it?
now. So why aren't we using it?
>> Well there are tempted to try to scale this up. In fact, even you can think
this up. In fact, even you can think about uh uh many of of course when you try to scale there other factor comes in right in terms of whether or not the the
the the the scalability and so on is all related to to to all the design. Um and
um indeed we actually tried something else uh you know uh for example things are much more scalable also I also tried to we also scaled up with all the resource we have sometimes I don't know
about company right we we are very limited in resource to verify even our those architecture scales we can only do up to probably a few you know couple hundred cars and so on that's about it
uh with our academic resources and hopefully that will commence but one thing I think recently we did u um to simplify the current practice in Dino right you know the meta has to which is
a pre-train the state-of-the-art uh people everybody talk about world model visual world model and uh that's sort of the best model and the the meta put a lot of effort engineer effort to
pre-train the visual representation model uh which is still the best and they train on gazillions of images and they were using contrastive learning but it's a it's a very uh remarkable
engineering uh feed And now people using it, right? Um turns out actually we
it, right? Um turns out actually we found that uh the system can be dramatically simplified. Once we realize
dramatically simplified. Once we realize what the purpose what they're actually really trying to do, right? We have a work called a syneno simplified deno version one version two. We simplify
both version the architecture is dramatically. We get rid of you know
dramatically. We get rid of you know dozens of hyperparameters I have to do and architecture become you know extremely extremely 10 times simpler and then the performance is better we
managed to scale up to uh probably few few hundred millions of scales the appleto apple comparison were dramatically much easier to train much efficient everything is explainable I
think that has seriously draw attention uh from the meta team and also from the Google team and it's currently they are there I I know there are serious effort that uh they're trying to scale this the
new architectures up.
>> Yes. Um we interviewed the Dino folks at the time and we've spoken to people like Ishan Misra. There's there mean there's
Ishan Misra. There's there mean there's a potential tangent there about they're using this kind of um non-contrastive um self-supervised learning and also there's the whole unsupervised thing and
and how useful those representations are for downstream tasks. Maybe we could go there, but I should say that Kevin Murphy, I'm I'm interviewing interviewing him soon and um I know that he reviewed your your book very very
carefully and he asked me to give you this question. He said code reduction is
this question. He said code reduction is great but must be subject to prediction or reconstruction loss in data space.
How would you go beyond token prediction which seems especially weird for images?
So that that's what Kevin asked me to ask you.
>> So this is actually a great question, right? Um so in a in a rate reduction
right? Um so in a in a rate reduction remember the the lossiness is actually coded through the the epstone ball we
actually uh uh try to capture the uh the the samples how they connect with one another. Um right now we actually if we
another. Um right now we actually if we just minimizing the coding representation uh through this uh uh lossy coding and the error is kind of
controlled by the IPS law but not enforced right so we respect the IPS law um through this loss coding process now
to truly ensure remember everything could go wrong also depends on the number of samples you have maybe the image you choose is wrong because the data does not have that density so you
not be able to percolate. Hence the
reputation learn can be very very funky.
So now in order to to ensure that your your your your learn the reputation distribution learn internally actually authentically reflect the original
distribution up to certain precision you have to decode right there is a constant encoding decoding actually our brain do that all the time predict coding and so on so forth
and hence that encoding decoding and to verify if there's error remain in your prediction in your recon construction matters a lot only in that now the
question is if we don't have the oh do I do we necessary back to our earlier discussion right do we really need to measure that error in the data space in
the original token space uh if we have that option so be it do that right make the engineering simpler make the but if we really want to have a system just
like a human to selfarn just go out to observe right at with two eyes or with some sensors. Then we have to come up
some sensors. Then we have to come up with a way to make sure that our sensing process is accurate enough so that we can do
everything internally. We can predict go
everything internally. We can predict go back rapidly back and observe compare what we predicted and what we observed
through same sensing channel. We compare that locally. In
channel. We compare that locally. In
theory actually prove at least under idealistic cases this is possible. We
can minimize the error once we correct the error. Hence the internal
the error. Hence the internal representation will the error in a token original data space will diminish but under tech condition under general
condition we still don't know. We
actually have a paper prove that for the when your data distribution is a mixture of subspaces you can rigorously prove that's possible and if the dimension of the subspace is low enough compared to
the capacity of the perception process.
Now for general distribution we believe this is true. This is actually how we'll be able to learn all the low dimensional dynamics structure in the natural data
in the motion in the in the predicted um world. So I think this is something in
world. So I think this is something in the future we can decide but end to end works if you have the option to do so or if you have you don't have that option
you have to figure out how to do this autonomous under what condition you can do this autonomously and allows you to do autonomously to reduce the error to almost zero.
>> We spoke about Dino but another example would be VIT. Um, we interviewed Lo Lucas Bayer in Switzerland earlier this year. So, he invented VIT and if I
year. So, he invented VIT and if I understand correctly, crate is now very very close to VIT, but it it's so much more principled. It's explainable and so
more principled. It's explainable and so on. How close are we to knocking VIT off
on. How close are we to knocking VIT off the off the leaderboard if you like? In
fact, u I think in many of the comparison we're already very close if you compare it's hard to compare apple to apple but in terms of if you similar uh the parameters we're very much on par
and also by the way we never really quite put much engineer effort into we just want to verify the concept um indeed there one thing come out of the
vi the crit is that uh what we find is not only the architecture designs principled but then once we did the training right The internal structure learned
are both semantically, statistically and geometrically very meaningful. Indeed
each head actually does learns uh you know similar structures all gets basically each channel each head truly become expert of certain type of visual
patterns for example legs of animals ears of animals faces of animals. So we
see that very very clearly with crates but we don't observe that in the vit of course you can see vit may learn this is actually the the the interesting thing
right early days people I'm sure large models if they have redundancy they definitely learn things internally but it's very hard to say which part of that network learn the correct channels learn
the correct operators because it is embedded in a more some redundant structure right so early days people call this lottery uh you know lucky
lottery or lock lottery ticket right it's somewhere in there right then then people try to distill that that justify you should distill you should actually
be able to compress even people do this Laura thing you know all the post-processing right justify that is necessary and some people you find after the pro post processing not only the
network become smaller the performance gets better right and so on so forth now probably we don't have to do that at least you know the the architecture does what is supposed what it's designed to
do right and we can actually at least explain what each component is doing something statistically geometrically very meaningful and there also results if there's enough data if your
optimization uh is done training is successful that's those structure pops up naturally the the structure will do what they're
designed to do >> and and final Many um ML engineers and researchers watch the show. Given
everything we've spoken about, how can they find out more about your work and how can they get started building these kind of architectures? Um I think most
of our actor are open sourced on GitHub um including create uh uh early reduet there may not be is conceptual but not
very practical create and also even toss all the all the codes are available but by the way they're sort of kind of academic implement we don't we never be able to have the resource to scale them
up most are scale up to GPT2 or image 21 that that's all we can afford simply D is the one we scale the most.
We we exhaust a lot of resource and a little bit higher than that but still no comparison to all this industrial scale at all but I do believe that the meta and Google are doing something about
dino simplified dino and the codes are there um and also of course if for the methodology of course um this is one way why we bite the bullet uh uh to uh wrote
the book in the past two years we believe that although there's a series of papers uh but we believe that for people to get a big picture, the more systematic introduction. We put together
systematic introduction. We put together the the books also we open sourced it.
Um and we we will post link all the data, all the code as well. We are also teaching the course. So all the we actually will have students uh practice
most of the new architectures method. So
all those codes will be made public available and so I think that might be a good entrance if you want people want to learn the methodology understand the theoretical chain of evidence and also
even the empirical chain of evidence and I think the book is has is attempt to do that we are already start to organizing we're not done yet but we are start already organizing if you find of
chapter seven we're already doing that uh in chapter seven to collect the theory seriously ly to to all the real world data and the task such as image
classification, image segmentation and the pre-training and even language uh GPT2 type scale language models as well.
>> Professor Ma, it's been an absolute honor. Thank you so much for joining us
honor. Thank you so much for joining us today.
>> Yeah, thank you very much.
Loading video analysis...