The Mathematical Foundations of Intelligence [Professor Yi Ma]

By Machine Learning Street Talk

Summary

Topics Covered

Intelligence Formalizes as Parsimony and Self-Consistency
LLMs Memorize Language, Don't Understand It
Abstraction Marks Phase Transition Beyond Compression
Current AI Replicates Empirical Memory Formation
Rate Reduction Enables Principled World Models

Full Transcript

In the past 10 years, I think the question about the intelligence or artificial intelligence has uh captured people's imagination. Uh I'm one of

people's imagination. Uh I'm one of them, but it took me about 10 years try to really understand uh can we actually make understanding intelligence a truly

scientific or mathematical problem to formalize it. You probably will get some

formalize it. You probably will get some of my opinion and also the the facts about it and probably will change your view what is intelligence is which is

also a very so certain process for me.

How do we clarify uh some misunderstandings the common misunderstandings about the intelligence? Through this journey uh

intelligence? Through this journey uh maybe we'll gain a entirely new view about what we really have done right in the past 10 years. The practice of artificial intelligence the mechanism we

have implemented the all the mechanism behind all the large models deep networks and the large models are truly are and uh the true natures and uh hence

understand their limitations and also what it takes to truly build a system that has intelligent behaviors or capabilities. I think we have reached

capabilities. I think we have reached the point right we'll be able to address what's the next for understanding even more advanced form of intelligence what's the difference between

compression and abstraction difference between memorization and understanding I think of future those are the big open

problems for all of us to study MLST is supported by cyber fund link in the description >> the idea of having to traffic and

squishy people in order to make our systems go is not immediately appealing.

Let's put it that way.

>> This episode is sponsored by Prolific.

>> Let's get few quality examples in. Let's

get the right humans in to get the right quality of human feedback in. So, so

we're trying to make human data or human feedback. We treat it as an

feedback. We treat it as an infrastructure problem. We try to make

infrastructure problem. We try to make it accessible. We make it cheaper. We

it accessible. We make it cheaper. We

effectively democratize access to this data. Professor Ma, it's amazing to have

data. Professor Ma, it's amazing to have you on MLST. Welcome.

>> Thank you for having me.

>> So, normally I ask guests to introduce themselves, but given your your stature in the field, I think it's best that that that I give you an introduction.

So, um, Yimar is a worldleading expert in deep learning and artificial intelligence. He's the inaugural

intelligence. He's the inaugural director of the school of computing and data science at Hong Kong University and director of the Institute of Data

Science at the University of Hong Kong.

He's also a visiting professor at UC Berkeley where he previously served as a full professor in electrical engineering and computer science. He's an ITE E

fellow, ACM fellow and CM fellow whose pioneering work on sparse representation and low rank structures has fundamentally shaped modern computer vision and machine learning. His

recently published book learning deep representations of data distributions proposes a mathematical theory of intelligence built on two principles

pasimony and self-consistency.

This framework has led to whitebox transformers known as crate architectures where every component can be derived from first principles rather than empirical guesswork. So um

professor mah tell me about your book.

>> You know about you know uh eight or seven years ago you know you know deep networks uh deep learning has been pretty you know changed the practice of

machine learning or artificial intelligence uh in the past decade. Um

about uh eight years ago I had a chance to get back to Berkeley and um gave me a chance to look into this uh topics more deeply try to understand it from more

principled approach and hence uh the book is kind of a summarizing the kind of progress we made in the past uh you know eight or beyond years myself my

group as well as many colleagues uh trying to understand uh the principles behind Um the deep networks explained it

uh from our first principle and uh in that journey we also seems to embark on um a little bit beyond that uh find uh something probably more general behind

that is the uh the intelligence at a certain level of intelligence and hence um when I get back to joined Hong Kong U

about two years ago had a chance to design or redesign some of uh curriculum to reflect those uh some of the progress updates in our field, the rapid progress

in our field. Um so my students and colleagues decided that maybe it's time to uh systematic organize uh this body of knowledge

and uh reflect as a as a textbook as well as a new course uh which I'm teaching this semester and uh likely will be offered as well at Berkeley next

semester. So this is actually a first

semester. So this is actually a first time probably we try to uh provide a more principle approach to explain the

the deep networks um as well as some principle of uh intelligence >> and these principles are panimony and self-consistency.

So it's it's an ambitious idea that these principles could explain natural and artificial intelligence. What do you mean by that?

>> Intelligence artificial or natural uh or whatever uh adjectives you add to intelligence uh we have to be very specific. It's a very loaded world,

specific. It's a very loaded world, right? I mean, even intelligence

right? I mean, even intelligence itselfves may have different levels um different um stages, right? So, it's

high time we clarify that concept scientifically or mathematically, right?

Um then so that we'll be able to talk about uh study intelligence the mechanism behind it at each level. uh

there are some more unified principle behind even different stages of intelligence there's there's something in common there also s are different so it's high time we we we do that one of

the intelligence at the level is that common to animals are human right human we are animals that int level of intelligence is what we think are very

common to all life which is how memory how we learn knowledges about the external world and then memorize as part

of our memory and use that to predict to react to the world help us make decisions predict and make better decisions with for survival and so on so forth that's very very common and this

is the level of intelligence we're talking about very much for the book as well right and um hence for this level of intelligence uh how our memory works

uh today we also have a fancy word for memory we call it a world model Okay. Um

and it's how this how we develop such a memory such a world model and how the model gets evolved and uh how we use it that's actually this is the level of we

talk. So we actually believe that uh for

talk. So we actually believe that uh for this level of intelligence for how our memory formed and how they work is precisely the two principles are

incredibly important. um and we believe

incredibly important. um and we believe they're necessary uh is that me memory or knowledge is precisely try to discover what's predictable about the

world. Hence um for all understand all

world. Hence um for all understand all such information are have intrinsically very low degree of freedom. we call the

lowdimensional structures and uh hence the way to pursue such knowledge is precisely through uh trying to find the most simple representation of the data

and uh hence compression denoising dimension reduction is actually all just different words to pursue such knowledge such structure and hence that's the word

captured by the word parimony finding you know explain making things as simple as possible but not any simpler right um so this is Einstein

says this word uh this the sentence Einstein used to describe science actually this is also what the intelligence at least at this level is precisely doing the same thing um the

second part of the sentence not any simpler precisely says consistent consistency make sure your memory is

actually consistent with be able to um recreate simulate the world just right um not any similar. If you're simpler,

you may lose part of the um the uh predictivity uh predictivity uh and also ability to predict it well. And so

that's actually the those two actually uh coexist we believe and those are two principles parsimmony and the consistency or subconsistency are

actually the two characteristic about how our memory works. So we want to have understanding which carves the world up by the joints which represents the

important invariances in the world and the thesis is I think that compression might be necessary for understanding. My

possible concern with that is that what we are doing with machine learning is representing extent examples of a long phoggenetic

tree of evolution. Mhm.

>> So to what extent does knowing their representation now help us? Do we also need to know how they evolved and where they might go in the future?

>> The the process to acquire knowledge to gain information about our side world that's a compression. Uh find what is

compression. Uh find what is compressible, what has orders, what phenomena has orders. uh has low dimension structures

orders. uh has low dimension structures that allow us to predict to rule out uh variabilities to predict world uh

tomorrows or predict world better in that sense. Um so in a sense that uh

that sense. Um so in a sense that uh that is ability we uh believe is really what intelligence is all about uh at least the the common intelligence we're

talking about right we can talk about the more higher level intelligence later uh um and if you look at the the history

of life how life was developed um um so we actually come up to believe right uh uh you know you know the the the mechanism that we laws that governs the

physical law world we call the physics right but what is the mechanism governs the evolution of life I think it's intelligence right even the process you

mentioned that uh through evolution and um life evolves um precisely they they learn more and more knowledge about the world and they encode them through DNA

to pass it on to next generation and that is a compressing that's a that's a process to compress logic that learned about the world through our

DNAs. But the the the the mechanism to

DNAs. But the the the the mechanism to update it is actually very brutal, very brutal force and um through you know

random mutation and the natural selection. Yes, it does evolve. It does

selection. Yes, it does evolve. It does

advance but at a huge cost of resource time and also very unpredictable which if you if you're acute you

probably observe there's some similarity with how current big model evolves right um many many groups try uh without

principle try and error empirical and um the lucky one survive and gets advocated uh you everywhere and become very very popular right dominate the practice so

so in a sense that you can make the allergy right I think to the people you know students ask me at which stage our artificial intelligence uh is at today

um then there's already a allergy in in nature right we are very much at the early stage of the life form right and so hence that is a compression process

that's a process that also gain knowledge about the world But of course later on we develop uh individual animals to develop the brain develop you know neural systems develop uh senses

including visual and touch and so on. So

we actually you start to use a very very different mechanism to learn to compress our observations to learn knowledge and

to build memories of world and even individuals start to have that ability rather than just inherit knowledge from their DNAs. So that's a different stages

their DNAs. So that's a different stages uh about uh and then that part of the knowledge is not no longer encoded in our genetics in our genes but also in our brains.

Um and that's actually that's actually a level of intelligence we talk about most of the time these days you know which is common to animals which is common to to humans and the le the knowledge the or

the intelligence we talk about what brain functions. Yeah, I mean I I think

brain functions. Yeah, I mean I I think we would definitely agree with the statement that intelligence as a system produces artifacts. So Shallay's example

produces artifacts. So Shallay's example is a road building network. It produces

roads and the system has adaptivity because it can create new roots where they weren't there before.

>> And then there's the question of well there are many ways to compress a thing.

>> So some ways of compression represent the world at a deep abstract level and some don't. So we might argue that LLMs

some don't. So we might argue that LLMs today even though they do compress the data they only compress it in a superficially semantic way.

>> And then there's this notion of well maybe we agree that intelligence is about the synthesis of new knowledge. So

it's the acquisition of new knowledge but we can only do that if the knowledge we already have represents the world at a deep abstract level. So rather than it being random mutations in evolution,

it's very very structured because the processes are physically instantiated which means rather than just doing something completely random, it's guided

by the process which created them. There

are a lot of uh confusions about uh uh what is the knowledge and also what's the process to gain knowledge, right? Um

for example uh many people says what all this language model is doing to the language. Um which is by the way it's

language. Um which is by the way it's very different. Don't forget our

very different. Don't forget our language is a results of compression is a res language is precise the code we

learned to represent the knowledge we learned through our physical senses about external world through billions of years of evolution or billions of years

of as as our brain how our brain evolve right it's a result of that our knowledge is that that actually represents knowledge and uh hence the

the the language is natural text. We use

language natural language to encode other knowledge as common to all people.

Now we're using another model another compression process to memorize it. Um

in a sense that that mechanism you can argue that mechanism what those large language model is doing are further

treat those text as raw signals to further through compression to identify their statistical structures internal

structures. what he's doing is actually

structures. what he's doing is actually not very clear is maybe will just help us to memorize them the the text as they

are and regenerate them right and it's not going through a process just like how we acquire our how our natural language was developed is through a very

um long and our language are actually grounded with our physical senses our world models as we know as memory Right. Our language is precisely try to

Right. Our language is precisely try to describe that uh it's abstraction of that world model we have in our brain.

It's a small fraction of the knowledge that worth sharing with each other.

Right? Far smaller than the actual model we have. We have many things in our

we have. We have many things in our senses in our memory. We have there's no way we can express them in words right.

Hence you can think don't forget right our brain there's small fraction process natural language but the majority is processing visual sensors and also motion senses data. So that's actually

says that how how how what the role of language is doing. Now people actually very much confused the the process actually now with natural language mode we used to process reprocess the natural

language or the knowledge as we know the human knowledge with common to human society we know this fraction of knowledge is actually very similar to what maybe we're trading our raw data

right hence you can see that even the mechanism we transformers or whatever architecture we're to do and we are very common to reprocess them uh for example

the videos, visual data or we treating languages as visual data. we tends to confuse right um the the mechanism we

actually probably uh memorizing or extracting knowledge from other senses with what we should be doing with uh confuse that that process with we

probably large language model is actually understanding the lateral text right so there's a very very fundamental differences right using the same process

to compress raw data to extract statistic correlations to stack low low dimensional structures. That process

dimensional structures. That process works for for for for the senses, right?

To build the knowledge, but we're actually applying that to our knowledge to our natural languages. We pretend

that is understanding. That's probably

not. You cited Max Bennett in your talk.

By the way, folks at home, you should watch um Professor Mar's uh talk on on intelligence and some of the ideas in in your book, and I'll put a link in the video description, but you cited Max Bennett, and I read his book, A Brief

History of Intelligence, and he also had this really interesting idea that that language is basically a set of pointers, and we're actually sharing simulations and second order simulations, and and in

a sense, those are just pointers to the simulations, and that's where a lot of the semantic content is. But you also spoke about levels of intelligence. So

it's a wonderful idea, isn't it? That we

have this phoggenetic information accumulation which is very slow. So you

know every single physical generation and then we have this ontogenic accumulation in our lifetimes and then we have social accumulation. So we have a big hard drive in the sky where we

kind of store these pointers and then we also do science which is very abstract where we we derive well we we just hypothesize about things and what's very

interesting about science is um you know there's always been this division between empiricism and rationalism. So

empiricism says if the idea is on the stage it's in my mind and this other idea is that sometimes we just conjure things which are not in the data. So

that science thing is very interesting.

>> Exactly. So see you just mentioned the four uh stages and we sort of uh elaborated a little bit in our book, right? How people uh those four stages

right? How people uh those four stages actually something in common. The common

exactly try to extract structures from the data and record them. record that

what's predictable and through compression through den noising through dimensionality reduction capture correlation in the signals and uh I use

that for predictivity uh for predictions and so on and that's in common hence for all four stages of uh of uh intelligence

the principle of parsimony the principle of self-consistency or consistence at work however through different mechanism right the different uh code book

different mechanism for updating for optimizing the information or even acquiring the information that's actually uh very uh important to know.

Now the the main uh thing you mention also it's a it's a point of confusion is precisely the

the animals or humans even social science at the stage of before science appear right almost you can see almost all the knowledge we gain are through

empirical approach kind of passive we observe and we uh we try we make some errors and we learn from our mistakes then we record that

right for example um Chinese medicine Indian medicine for many many years works right it's similar it's very similar to even beame similar to even

how DNA evolves we search in a sort of a less organized way and somewhat by chance some by accidents and we but we accumulate knowledge become very very

useful we understand how weather changes how planets moves in a very empirical way, right, for many many years. Um, so

and we share that we rewritten that in languages in text and pass those knowledge down to next generation similar to what DNA does to to to to the knowledge they learned through different

process.

Now there's a huge distinction uh we I call it you know the uh around the 3,000 years ago um we don't know what happened to be honest right uh we're able to maybe the whole process to acquire

empirical knowledge is just through compression but suddenly we be able to do somehow to do abstraction right um we start to develop knowledge

that's far beyond empirical observations uh which you think about it's very for example the the notion of numbers natural numbers we count up to 500.

But suddenly we realize this process goes to infinity right kids start to do that right when they in in middle school they start to everybody most people start to get that point and there's an

amazing thing is that about uh you know years ago that all this you know the yuclet formulate the geometry right one of the assumption I don't know if people

know this it's a very long empirical which says two parallel lines never intersect Why the word lever is actually

infinite suggested it's something you never would observe empirically right how did we come up with that idea how

what happened to our brain we are start to jump from compressing empirical knowledge identify correlation to something we actually formalize

abstract right hence in my book I asked in towards last chapter I ask is there a difference is compression order is probably not right. Abstraction is

definitely related to uh compression but there seems to be something different something more right you know the the

car proper right the famous uh you know science uh scientific philosopher uh said that the science is the art of over

simplification right um the ability to abstract uh we be able to hypothesize things and there is a distinction

between hallucilation and hypothesizing I believe but what is it right um and we know through compression we can memorize we can learn the data distribution we

can memorize we can find a good very good representation we can even using that representation to regenerate data with the same distribution we call it memorization but is there difference between

memorizing those data distribution than understanding it we can emulating how we conduct logic duration, right? Just like

now we can do supervised fine-tune or through train of thought to and also using the uh reinforce learning to force the large language model to emulating

memorizing how we solve logical problems, how we solve mathematic problems. But that is that solution is based on based upon understanding

logic understanding necessity of logic mastering the mechanism logic applied it or it's just emulating the process. So

we don't know. So we have a lot of question now is the difference between compression and abstraction is the difference

between memorizing and understanding right and it's kind of similar to when touring was faced the question what is

computable what is not computable we know there is a difference but how do you crystallize that or we also now know is P equal to MP right we know there's

might be a distinction. Can we formalize that question? uh if we believe there's

that question? uh if we believe there's no difference, prove it. or if there is a difference, how can we quantif qualitatively or quantitatively be able

to see what's more beyond compression that will take us to the level of abstraction and hence I believe this is hence this is the so we call the phase transition right from empirical

knowledge developing empirical knowledge to scientific knowledge and what is the distinction uh to me sometimes I call the last stage of you know the the the

the stage of intelligence as the true artificial intelligence. If people ever

artificial intelligence. If people ever bother to read the proposal laid out by the folks in 1956, you'll find that actually that's the level of

intelligence they truly mention they truly meant meant okay to to to to work on. But yet from all we understand about

on. But yet from all we understand about the practice in the past decade right we very much are reproducing uh the kind of mechanism that is at the

level of memory how to form empirical memory. In fact, I believe even the

memory. In fact, I believe even the large language model are precisely memorizing like the large volume of text

through how we form empirical knowledge, right? It's using the same mechanism how

right? It's using the same mechanism how we form empirical knowledge to memorizing the natural language. The

knowledges are encoded in natural language. Hence whether that it's

language. Hence whether that it's equivalent or equal to understanding that's a big question mark.

>> It it is tantalizing isn't it? Thinking

about where when when we when we come up with these new theories that don't seem to come from the data. Where do they come from? You could be a Platonist and

come from? You could be a Platonist and you could say they are just a gift from God and somehow or a nivist and somehow they're in our our brain. Or maybe we could subscribe to the idea that there

is um there's a kind of deductive tree, right? So there is the the tree of all

right? So there is the the tree of all possible conceivable knowledge and that represents our cognitive horizon. And if

only we could build systems that could um acquire that tree very abstractly.

And if we could design a compositional system that could creatively explore that tree, then somewhere in that tree we would be able to discover these abstract theorems. But then there's the

question of well why don't large language models do this now. So we see the arc challenge for example. And what

we see is that models are very very bad at doing abstract compositional reasoning where you need to take abstract things combine them together to to adapt to novelty. They don't do that

very well.

And in a sense, one one kind of um optimistic view is that these models are learning lots of factored representations and it is conceivable

that something like a large language model could do that. But the other school of thought is that it's it's just not possible because they don't have abstract enough understanding. What do

you think? I believe at least from all our understand current understanding about what the language large language model at least current architecture is doing is precisely using the same

mechanism that we extract empirical knowledge uh from the data the the the correlations the low dimensional

structure within the data to memorize the natural language right which to some extent representing the knowledge we have right hence Then the question is then the D language

model is precisely using the same process as we acquire our empirical knowledge to process the the the large volume of the natural language um from

the mechanism part I don't think it's actually needs to uh understanding now but in order to make that statement conclusive you need to know then what is what extra thing that leads to your

question right then what what means by understanding what might be might be able really have that deductive structure right so this is a big question to be

honest I think that's a question truly that scientific community needs to answer now to also to address or at least to discuss now and what do we mean by that right um so you know in a in a

in in the sense we have modern science there's always two schools of process that allow us to propel the science to advance one is inductive right uh we do

experimentally We observe and why is deductive right? Once we have accumulate

deductive right? Once we have accumulate enough inductive uh you know observations experiments empirical observation, we start to make hypothesis, assumptions, right? Axioms

so that then we actually be go through a very rigorous deductive process to derive what's implication of those assumptions then reach conclusions that is testable

that is actually measurable. Then

through the verification through experiment we can measure then we can either verify or falsify the the

original assumptions right that's a very powerful process but that process precisely relies on we have a very rigorous logic deductive system without

that right we cannot overthrown the original assumption right so that is very fundamentally deep um you know

abstract process So the way we have that in our brain we can argue whether or not animal has that or at what point we had this we reach this uh critical moment

then we'd be able to develop that ability our brain evolved to what maybe it's it's a some fa we go through some phase transition in our brain structure will allow us to identify those

structures platonic whatever you call it right structures and we are able to not only the original scientist discover

the logical, causal and abstraction but also other people also gain the ability to understand it.

That is actually quite amazing if you think about it right even the human beings through language we communicate but somehow we all at certain stage develop ability to understand to be able

to learn mathematics even this is discovered somewhere else but I can reach sim similar level of understanding as original discovery I'm also be similar similarly convinced by the proof

is rigorous the logic deduction is something necessary right not just some like natural language has has ambiguity in it. Right? So this

is something up for debate. Is this

really that you know um there's a god there that there also there's some invariant knowledge there truly you know ground truth there. Uh we don't know. Um

but definitely this the ability how do we allow us to reach that level ability I think that is truly the next we need

to understand what are the mechanism implementable reproducible mechanism allow us to recreate to have a

system to able to gain that kind of ability I think that will be the next stage of intelligence

So we'll be able to have artificial system reach the stage of human educated enlightened human right beyond as our

our humans as a just as a animal >> and do you believe in practice that's possible? I truly believe that is a part

possible? I truly believe that is a part of our brain um or function I think can be reproduced can be discovered

understood and even reproduced um but for now uh what exactly that melanism is I don't think we have much clue right we

know the afters just like you your example we know the road it build right the road building company or network right they can build a road we know. So

even the the today we learn we can learn logic we can learn mathematics but what is the mechanism allow us to create a new mathematical theorem create a new

scientific theory conduct the logic deduction right and understand it that what is that mechanism we don't understand we only know the art after

facts right the logic makes sense but why it makes sense that's still What is the mechanism in our brain uh that we

shared that the all this process all the deductive process makes sense to all of us and uh is still quite unclear.

>> The reason I asked the question is a lot of cognitive science folks they say that we understand because we're causally embedded and what they mean by that is we evolved with the world which means

all of the representations in our brain co-evolved with other things in the world and it's deeply rooted. So the

implication is that intelligence is quite specialized and you could say intelligence oh it's just the efficient search of the space of touring machine algorithms and yeah that would be

correct but it but it's describing the what not the how. It's it's so um it's almost trivial to say that it doesn't describe how we would implement

intelligence. Whereas if intelligence is

intelligence. Whereas if intelligence is the acquisition of knowledge, it must be domain specific, right? Because the road building company that we spoke about,

they can't build any type of road. They

are restricted in terms of the materials that they have access to and presumably different types of knowledge are quite different to others. So an intelligent

process might be able to acquire knowledge over here but not over there.

How specialized do you think intelligence is? Remember the road

intelligence is? Remember the road building company that's just a allergy is a special cases that's exactly sort of echo with what we mentioned earlier the intelligence has different stages or

even different forms but there are always something in common there right you know even the early stage how DNA evolves how our memory evolves how human

society logic evolves or how our scientific knowledge evolves right those are precisely they're different you can say there are different builds so some kind of building are building some are

building roads but the mechanism are common they're using similar um for example the concept of compression discover the the mechanism that's common

behind all this is that through discover what is structured what is not random and I think that is common through a you

know under the principle of parsimony or compression or dimension reduction or den noising to discover those structure that is common although The mechanism you know operation might be different.

The domain knowledge distribution applied to might be different. Some are

discrete, some are continuous. Some are

higher degree the intrinsic degree are higher. Some are lower. Some are

higher. Some are lower. Some are

simpler. Some some can be formulated as mathematical physical equations. Some

cannot, right? But they still can be learned, can be compressed, can be memorized through other mediums in DNA, in our neural networks, not in you know

differential equations. Right? So those

differential equations. Right? So those

actually there are things that are very common the mechanism behind purpose are common mechanis even the principles are common but the realization right the

physical realization could be different and the optimization mechanism could be different the code book we learn could be different right so this is something we need to understand um I I think once

you understand that a lot of it can help you understand we can understand a lot of things around us right what is common behind all the intelligent behaviors and

also each stages or each deforms may also has their own domain specific characteristics. I know you're a big fan

characteristics. I know you're a big fan of um cybernetics which came from the 40s uh with Norbert Wina and it described this cybernetic action loop

which is an agent which senses and acts in in this closed feedback loop and just you know based on what you just said maybe the mechanism of intelligence is

the same but certainly the action space is different. So, so different um

is different. So, so different um embodied agents in this situation would be able to do A, B and C and over here something different. So, could it be at

something different. So, could it be at least specialized in the action domain?

>> Those scientists, those pioneers that uh uh about those are the pioneers in 1940s, their interest in intelligence, but the level of intelligence I think

their interest is mostly at the animal level and the common human level. It's

about how our brain works. It's not

about DNA although they make some analogy but they very much specify how our brain acquire our memory build a world model quickly through perceptions

and and actions interacting with it to predict and make mistakes and learn from that process right so I think that's the stage they're study they may not quite

get to the advanced stage that we'll talk about science they care mostly about how our memory works the animal right to build autonomous systems autonomous machines so they can

emulating our ability at that level right so hence it's they call we call the cybernetic programs by the way I mean the point is that a lot of people um there are two things of course

cybernetics word was a little bit abused later right become just like artificial intelligence actually become very bad word for certain period of time right I think it's probably and also the many

people it's understanding cybernetics in a very narrowminded they actually think it's just about a control Actually it's not uh if you read the book right um and uh by no winner he actually

characterized is a book actually characterized all the at least the necessity necessary characteristic what intelligence system the system at the

animal level should have to me how to record information he actually needs to information theory how do we correct error which is the feedback control how do we improve our decision- making he

actually mentioned the game theory right and and so on so forth. He even uh discussed the necessity of addressing nonlinearity which actually explain why

our brain has waves as for his but it's very curious he's definitely interested what are the necessary characteristics for a system to be intelligent

although he may not he for short to how how they should be put together but definitely captured some of the essential characteristics a autonomous intelligence system should have which

surprised It's got forgotten by our past decade of practice of artificial system building artificial intelligence systems right

and don't forget those are necessary characteristics that you know and at least those pioneers are convinced a intelligence system should have um so this is actually something I think you

know we should probably learn a little bit lesson from our history >> can you sketch out the journey from information theory to your maximum um rate coding reduction system.

>> This is actually a very interesting question. I'm not actually honestly I'm

question. I'm not actually honestly I'm not information theorist, right? So I was trained as control theorist but early when I was a graduate student I also do communication. I took a lot of u um

communication. I took a lot of u um communication random process information theory course uh although I didn't end up with doing it. Uh so for for for many years you know this has become um

something I never really quite practiced until about a few years ago I was studying uh what deep network is doing and also get

come across what my intelligence is all about realized that maybe intelligence is trying to learn the common mathematic

problem behind at least the learning uh knowledge at the level of memory is probably pursuing a low dimensional structure or low dimensional

distribution of high dimensional data.

Okay. So once that became clear to me then it becomes very soul searching process for me. Now

if low dimensionality is the only prior the only so-called quote unquote inductive bias or assumption we can make from there. Can

we deduct can we deduce everything from there? No. But if this is some only

there? No. But if this is some only thing we can use then what's special about it?

It's the dimension very low. Hence the

volume of the data should be very low.

Then there's actually comes a lot of technical u uh challenges is that then how do you differentiate if I have a set of data if there are two

models both are low dimensional can equally support the data explain the data then which one do I choose right

so the challenge here is that if two models are both low dimensional then the volumes are are all zeros And I have a example in my book, right?

It troubles me a lot, right? I have

eight dots on a line. One is evenly distributed eight dots. Another one is, you know, four dots are clustered.

Another one. And the interpretation is very ambiguous if you think about it.

What's wrong? I treat all as just eight dots. Each one occurred one,

dots. Each one occurred one, you know, or it's probability 1/8, right? Nothing wrong with it. Or when

right? Nothing wrong with it. Or when

should I say they all lies on a straight line right but a line already has one degree one dimension higher than dots right then also what's wrong I say those

are eight dots line plane right but they're all zeros zero dimensions right so hence there's a question is how do we measure

the volume of the data right if you want to compress you have to have a more generalized notion of volume to measure the space spanned by the data. So that

come forced me to come across the concept of entropy. But entropy also is kind of limited because precisely does not quite differentiate those kind of

interpretations. Right? If you think

interpretations. Right? If you think about there are all zero dimened one dimensional distribution then the differential entropy is negative infinity. You are comparing you know

infinity. You are comparing you know infinite with another infinite or zero with another zero. So hence that we sort of actually come across maybe a lossy

coding right is just like Shannon developed after information theory he actually mentioned that when we actually do coding we do redistortion we do not see coding right that actually become a

sort of magical source allow us to now we have a more generalized much more general measure of volume for data in arbitrary space right and the the the

measure the support can be actually degener differentiate one low dimension model against another one, right? And

that becomes a measure. We can weight the data to differentiate different models about different models. Now allow

us if we compress this the coding rate we compress uh those coding rate will allow us to pursue those distributions in a very high dimensional space. In

fact, the very popular diffusion denoising process precisely doing this they're den the noising process of precisely reducing the entropy so that

we pursue a representation that is lower dimensional lower entropy in the end converge to the distribution of the data. This is the first stage. Now for

data. This is the first stage. Now for

memory it's not just the pursuing find where the distribution is right. I I use the analogy right? So uh the find where the distribution is it's also about

organizing it right a lot of people says oh you know learning is about compression and hence you know why don't we do all the way to commograph complexity no you don't want to do that

first of all complexity is not computable second of all we all know if you really manage to compress your data complexity the the codes will be random

basically the program itself specify the data will be rendered We don't memorize a bunch of random numbers in our brain, right? Hence,

actually our memory in our cortex is highly structured right different type of objects is very well organized in it to cortex and our spatial understanding

is very well organized in hypoc campus rightly structured because the structuriness allow us to access because we want to access the logage repeatedly.

We want to use it very efficiently.

Hence the logic has to be we use it under very very different conditions.

Hence I said you know our brains very much doing bas inferences. Once we learn that distribution when we organize the we transform the distribution to a very

structured and organized way. So the

maximum entropy you can think the maximum rate reduction is precisely reflect that necity. You do not just by reducing the coding rate to find where

the distribution is. You also want to transform it in a new representation such that the rate reduction gets maximized. The reduction gets maximized.

maximized. The reduction gets maximized.

Hence the representation subsequently become structured and organized. Then facilitate efficient

organized. Then facilitate efficient access and you can access memory under all types of conditions allow us to do all types of condition prediction

generation and estimation. Right? That's

the essence. Hence reduced entropy reduced entrop reduce reduced the coding rate and then maximize that reduction of

coding rate. They are actually reflect

coding rate. They are actually reflect two related process for building a good memory. First of all the the manifold

memory. First of all the the manifold hypothesis comes to mind which is this idea that all natural data falls on some lowdimensional you know structure you know with a low intrinsic dimension. And

the other thing that springs to mind is I mean I'm a fan of geometric deep learning which is this idea of you know we should imbue inductive prior in the

system which represent symmetries and and geometric structures in in the world and and I think as a principle that's deeply embedded in in this idea.

>> Exactly. Um if if you look at my my whole life I have written four books right. uh my early interest is studying

right. uh my early interest is studying computer vision. Um and the first book

computer vision. Um and the first book is division and from that work I study multiv- view geometry from that work I in my whole book the the all the four book is

actually about the one theme I realized that it's about structure in the data especially all reflected in the first book of street vision um the last

chapter I realized precisely about what importance of symmetry plays in our perception right and We perceive object

we naturally has a so human being we remember our vision we already recognize a long time ago right this days people always oh it's

about the vision is about recreated 3D absolutely not right all the people says we just multiple images we create a whole point clouds mesh sign distance

function lur gi splatter right we recreated scene see I can see from multiple angles Is this 3D understanding or will you

create some videos like Sora look at it looks good for you know absolutely not right this is not the representation

or understanding of a world model right our understanding is far beyond getting a whole bunch of you know point clouds or gshian splatter we can view from different angles

right have noticed that when we see something get we get excited because we understand the 3D we understand the content we parse it already in our brain

but the machine has no idea what the heck is in there okay it's just a bunch of point clouds it's a depth map okay we when we see this change angle we saw 3D

we already automatically recognize this is a hand this is a body this is a cup this is a this is apple right we do that we fill in that information with our brain we think machine once they can

recruit a 3D they all don't understand that this is a completely wrong okay many many work they say we're building 3D model by creating something for people to look at that's complete

purpose so look at our vision our vision model right we have hypoc campus we have our ID code is highly structured we understand the relationships

between our view centric objectentric and alocentric repetition right neuroscientist understand this very very well right scientists understand this

very well but not computer scientists right not computer vision scientist um some do see for example I give you example right in order to do spatial no we actually did a test with you on the

corner about a year ago test all the you know top multi-model models you know huge models GPT you know that the GMI

right to do very simple test the the title of the work called a eyes wide shot Right.

And it's a very simple test. Just test

the given images for the those language model or a big model multimodel highly trained highly commercialized models that do they understand the reasoning spatial reasoning. What is on the left

spatial reasoning. What is on the left of something?

How many objects there is in space, right? What is behind something? What's

right? What is behind something? What's

on top of something? Very simple

spatial. The question requires a little bit not even very deep spatial understanding. But all the models fails

understanding. But all the models fails miserably and the majority of them actually even worse than random guess right only I think only Gemini and uh

GPD uh is a little bit above human uh a little bit of about the random guess okay far below human understanding right so that's that's the status right if you do those meaning that you know those

actually 3D understanding is very very difficult but human we do this effortlessly right so I can easily point to you Right? Please hand me the the the bottle

Right? Please hand me the the the bottle to your left. Uh if you want to find, let's say, a shopping center, right? Say

go to the door, turn right, go to the once you get outside of the building, um head south, right? So through this this remember just through this simple

sentence already we switch from view centric to objectentric to alocentric. Right? So

our if we don't have this kind of model this kind of highly structured 3D models forget about you know people talk about embodied AI or world model right we

cannot conduct this very simple spatial references interacting we have this world model is be not to visualize we build a 3D model

to interact to manipulate to influence right we're not building a 3D model just to oh This looks like I can change my view. Look at it from this view or that

view. Look at it from this view or that view, right? Just visualize. No, turn

view, right? Just visualize. No, turn

360 degree to visual. No, we don't do that. That's not our purpose. Okay.

that. That's not our purpose. Okay.

Unfortunately, but this, you know, it gets distracted for that kind of uh visualization, it looks cool, but uh almost to to to us, if you really work

on robotics, uh for all cut, you know, navigation, local motion manipulation, they're actually pretty uh the usage is pretty limited. I won't necessarily say

pretty limited. I won't necessarily say they're useless, but they are actually pretty limited. Yeah,

pretty limited. Yeah, >> we should introduce the coding rate formula. I did have a question about

formula. I did have a question about that, which is there is a there's an epsilon on there. So there there's a bit of a question of how how do we tune that and and what does that mean? We should

also bring we've been talking about this a little bit, this um this concept of an LDR, so a linear um discriminative representation. And just more broadly

representation. And just more broadly with these inductive prior there's always the question of when we do abstraction to model regularities in the universe

there's always a little bit left over isn't there so to to what extent can we think of these things as natural >> actually you touch upon about the deep stone you touch upon a very very deep

question uh we actually uh it actually took me almost 30 years to understand it to be honest right we did mention that early on when we do try to differentiate

different measure different volumes. It

turns out lossy coding is necessary.

It's not just something that uh is something hacky. It's actually turns out

something hacky. It's actually turns out to be necessary and to do lossy coding.

In fact, we recently we start to realize that noise actually be you know plays very different roles and uh yet it's very confounded very confusing to many people even um this is

something actually my students will realize we actually probably will have some papers about it. I can elucidate this a little bit. If you think about the whole diffusion denoising model, right? People are very popular right now

right? People are very popular right now to do um why do we add noise to data right and to the whole world? Because we

don't know where the distribution is, right? So there is a phrase everybody

right? So there is a phrase everybody knows all roads to Rome, right? So why

is that? Has anyone give a thought to why all roads to Rome?

Because very simple at some point in history Rome builds the road to reach the whole world right that's a diffusion process then if you want to know Rome then you

do the den noise you follow the same way back you get to where the no room Rome is so that's the no dimeial structure that's where the knowledge is right

that's the osis is right so hence it's a very natural process that we add noise is adding noise to the data is precisely we building roads and the noising brings

us back remember where we come from and so on and that's a big deep slope. We

have to add noise to reach the whole earth. There's another actually there's

earth. There's another actually there's another noise right remember we only have isolated sample even we talk about manifolds

right but how many points you have on the manifolds how many points you observe they're always finite right but why do you call it a continum

why do you collect dots as lines planes surfaces when do you do That hence noise plays another role within the manifold. Even you have finite

the manifold. Even you have finite samples.

If you allow lossy coding, if you allow packing spheres in that things start to connect. You start to connect. Noise is

connect. You start to connect. Noise is

very important to help to connect the dots. Right? We all know the phenomena

dots. Right? We all know the phenomena of percolation, right? We see raindrops on the floor. You only see two phases, right? One phase is all the dots are

right? One phase is all the dots are isolated. another face is all things

isolated. another face is all things gets wet. You never see anything into in

gets wet. You never see anything into in the middle because there's a sharp face transition. Once the the the sphere once

transition. Once the the the sphere once the dots the density gets high enough they collects everything right. Maybe

that's a phase transition we reach. We

realizing a connected plane is a better solution to explain all the data more parsimmonious more economic

the cost to memorize all the dots versus memorize other plane start to switch maybe abstraction has something to do with that. I don't know but from a

with that. I don't know but from a compression point of view this can already allow us to explain when do we go from zero dimension samples to prefer

a low dimensional manifolds and also how go from that low dimensional manifold to reach the rest of the world right so you can see even in this process noise is already playing

this epsilon plays different roles and as at some point they gets collected right around the surface and that's we still try to figure out what happens but at this big two phases uh we already

know right the role of epsilon actually plays different roles right um and I think the the definitely in the past many years uh our understanding about

the subject how do we compress how do we uh pursue the low dimensional structure from finite samples it's a quite our understanding about this problem has

truly advanced dramatically I'm very happy honestly this is a question baffles being as a graduate students uh you can see even my early work about the lossy coding lossy compression reflected

my bafffulness about it and I'm really feel very uh thrilled that I recently start to understand those things in a more unified more not only theoretical

way but also even agorithmic way >> yeah it's so fascinating that we can look out the window >> and we we ignore so much detail we don't look at the leaves on the roads we we just find that structure and that's why

when I watched your presentation. I was

very intrigued when you said that denoising, iterative denoising is is a form of of compression.

>> I wanted to mention your ICML 2024, so last year it was in Vienna, right, with with Wang. And you found that when you

with Wang. And you found that when you have loss surfaces using using this technique, they are dramatically different. They're very smooth. There's

different. They're very smooth. There's

no kind of harsh local minima and so on.

What's the intuition for that? In fact

the phenomena our understanding about those phenomena is actually going back when we early days we study sparsity um you know when your data lies on very

low dimensional uh sparse surfaces uh planes right low dimensional planes or planes or low rank matrix right and in there we learned a very big lesson the

object function to evaluate those sparsity or low dimensionality uh those function are highly nonlinear non-convex Um but yet you know traditionally in our

orthodox understanding about the l convex optimization is always hard right and in the general classes and p hard and there's a lots of local spirus uh

local minima you get stuck with local minima you there's some stagnant critical points flat surface so basically the worst picture is very worse right it's a nightmare

uh but through the studied of those low dimensional structure sparse structure and that's what's actually featured in our my previous book right high dimensional a low dimensional structure of high dimensional data analysis. We

actually realize that if a lot of long convex problem even the optimization problem have long convex landscape if those problem or if those measure arises

from nature very natural resource those structure actually are very highly regular highly has symmetry.

the landscape actually are extremely benile.

Right? Quite contrary to our common understanding about the long linear optimization at all. Right? This is

complete 180 degree uh flip of views. In

fact, even the higher dimension helps.

The higher the dimension the better. We

call it a blessing of dimensionality. So

those regularity, those symmetry will tell us the landscape of this object function are actually beautiful, right?

And uh first of all they're highly regular there's no stagnant there's no flat surface there's no too many spirits local minima and even the local minima

they already have very clear geometric or statistical meaning and hence those landscape are very aminable for very simple algorithm to find the optimal

solution such as a graded descent which in almost indirectly explain why even we're doing gaz even more than 20 neural networks and know many more we're

searching uh low dimensional distribution in very high dimension spaces but somehow grad always end up with somewhere nice okay yeah fine you can run a long time but somehow you

always end up those landscapes are not that hard to traverse traverse right so it actually be precisely because those

object function are highly regular hence now get back to the rate reduction object function Right? If you look at the object function, it's not something

arbitrary, right? It's counting the

arbitrary, right? It's counting the volume of the whole minus the parts.

Right? It's something extremely objective, right? It's not like a loss

objective, right? It's not like a loss function people come up with randomly.

I'll add this term weighted sum at different weights. You know, if you use

different weights. You know, if you use this, you know, sort of a empirical penalty or empirical even some kind of ad hoc. So all the terms are describing

ad hoc. So all the terms are describing physical volumes of the data right.

Hence you should expect those are the quantity arising nature. Then from our lesson we realize indeed actually that actually those object function has very benile landscapes uh even the local

minima not only the global minima corresponding solutions that are give you a orthogal subspaces even the local ones right they are not the global optimal ones they have similar geometric

structures right and there's no other weird critical points that will slow down the search for those minimas so that's actually quite interesting So you can

see this this revolution allow us to understand right where maybe intelligence is precisely exploit and harnessing those things. So there's

actually a misunderstanding about you know last 10 years when we understand intelligence more and more there is a very big misunderstanding about intelligence right in machine you

you you you study you know maybe machine learning theory right we have a tendency to believe intelligence especially intelligence and nature is designed to

solve the hardest problem the worst case I actually beg to differ intelligence is precise ides the ability to identify

what is easy to address first, what is easy to learn, what is natural to learn first. Then only when that has been done

first. Then only when that has been done and the resource permit, they start to get into more and more advanced tasks.

Not everybody needs to learn advanced mathematics to supply. Animals don't.

Right? Nature find what is the easiest things with minimal energy, minimal effort to learn the most logic so they

survive the best. Right? Again this is the principle of parimony at play.

There's another level of resource parsimony at play here. Right? Hence

once you realize this you realizing understanding intelligence should be really understand what's really most common.

Right? the low dimensional structure most easy ones smooth ones benile distribution easy to get get away with a few samples fewer samples right uh and

very easy to formulate in fact that's what how science progress you know a lot of the physical models you know Newton's law they're very simple they discover the simple ones then we reach gradually reach to generate relativity and then to

quantum mechanics those equation gets f more complicated later right so this is the same process we identify what is the most common first right what is the most

easiest task first hence we don't want to many of the machine learning theory try to tends to so derive a bonds for the worst cases I think that's we

probably should think twice right >> I love that character characterization it's um similar to the least action principle in in physics >> exactly >> in in a sense we solve problems by taking many many steps in different

directions I think we still leave a little bit of entropy open we don't do pure hill climbing but collectively we acquire these stepping stones and the totality of that process as we solve

very complex problems. But I wanted to touch on you raised a very interesting point which is um we notice that when we have very large deep learning models

they tend to almost self-regularize and they they learn better and there's this phenomenon of double descent and and all of this tell me about that

>> fascinating question actually you are this question actually needs to um really u bring me back to early days when I tried to understand deep learning um when deep arrived There's a lot of

phenomena we try to understand. I'm I'm

one of those right try to understand those phenomenas u you know some something there's some good about dropout something about uh you know thresholding different thresholding

there's something about uh normalization there also come to somehow the model are very big um and parameters a lot somehow

the deep networks has uh do not have a tendency to overfit somehow they still generalize okay right and then of course people realize that there's a sort of uh

uh rather unlike the you know the traditional classical bias um virance virus trade-off but there's tend to double descent

I actually wrote a couple papers about it and um about the normalization about everything around 2000 um or late uh tw

uh 2019 I really told my student we should stop not to explain those isolated phenomena we only see where like the blind man to elephants each one say a little piece

each theory try to explain a little bit I think there should be a total total explanation to this if we get a big picture all the just the consequences or

or implications of that suddenly you know at that time we start to touch upon the the the concept of maybe that the

the process deeper are optimizing something the layer wise is realizing optimiz they're optimizing object that promoting parsimmon promoting no

dimensionality. Once we realized

dimensionality. Once we realized actually I was quite thrilled so then I told my student from now on we will no longer write any papers or about

overfitting. Why?

overfitting. Why?

Because if the neuronet networks is try to the operator is try to compress try to realize certain contracting map

compressed volume then you will never overfit right even I over parameterize it will never overfit simple example if I have data lies on a straight line a

one dimensional curve whatever I can embed this one dimensional line in a two dimension three dimension or a million dimension But if my op operator is

always layer-wise at each iteration, my operator is always just shrinking my solution towards the line in all directions.

I would never know of it. Even I

overparameterize embedded the line into a billions of dimension. I have a billions of parameter. But collectively

all those billions of parameter are all shrinking my solution, pushing the solution, den noiseise it, compress it towards the line, right? like a power iteration just like a PCA right power

iteration is in irregardless of what the dimension embedded computing the first singular values right it's always powerful it's converge with the same

speed you never over so compression by nature if the operator are performing compression or denoising which means this process will no longer overfit

anything right if you conduct it right if you converge the solution will converge on the the structure you desire for.

>> That raises a natural question. We were

interviewing Andrew Wilson from NYU and he's got this, you know, several papers about implicit biases where you kind of have a combination of, you know, hard biases with symmetries and and

everything in in between. And if what you're saying is true, then why do we need inductive biases at all? Could we

not pair back a little bit and just have really big models?

>> No, I see. See, this is the thing, right? Exactly. So, this is the thing.

right? Exactly. So, this is the thing.

Um um I was you know early on people don't understand deep networks and there's a lot of empirical uh trial and error people try to tends to use this

the phrase inductive bias to either as some kind of magical sauce that either explain the failures or success of you do certain way to the neuronet networks

design or how do you train the neuronet networks to be honest for for a long time I never understood what the inductive bias is um and maybe some regularization company is learning some

structures about network about the data.

Um but nowadays in my recent work I said probably from at least from what I understand um all the inductive bias should be uh formulated as first

principle right at least from we were able to for example deduce uh all the different network architectures including the recent white box crate or

transformer like or redunet like uh reset like architecture or mixture of expert like architecture all from the only inductive bias

is assuming your data distribution you are pursuing are low dimensional. Okay,

you can already get the form the the main architecture or form of operator of for each layer as a rest structure mixture of expert structure and and

those operator per layer are precisely conducting den noising compression or contrasting. Are there additional

contrasting. Are there additional assumption you can make? Yes, you can.

For example, if I my job is not just compress the data as it is. I also want to induce I wanted to for example in object recognition. I also want to

object recognition. I also want to enforce make all data I want my classification to be translational environment which is symmetry. Right? If

you allow my task will be environed to certain group action I want to compress them together. Voila what do you get?

them together. Voila what do you get?

You can then you still compression then you get a convolution naturally as the structure for the compression operator.

So convolution is not what we impose upon it's actually results from the first principle the the quote unquote inductive bias assume you want to

compress your data also you want your compression to respect translation environment or rotation environments that's the result that is the characteristic of the compression

operator for you to achieve that task right so there's a lot of um so we don't want to build in the the the inductive bias while we are searching for the

solution. The inductive bias in my

solution. The inductive bias in my understanding is should be the very assumption we make in the very beginning. The rest should be deduction.

beginning. The rest should be deduction.

The rest should have no induction anymore. Otherwise, we're doing trial

anymore. Otherwise, we're doing trial and error. Right? the inductive we so

and error. Right? the inductive we so basically when we build a theory we should have done all the inductive observations experiment and assumptions

already the good theory should start with the very few inductive bias or assumptions or axioms then the rest should be deductive I call that first

principle >> we've been speaking about pasimony which is what to learn and self-consistency is about how to term and we can sketch out

a journey I suppose from control theory to learning and and also this this methodology has some interesting um results I think around um you know the continual learning problem so let's

sketch that out >> you can see right so the the compression um um or the the even the rate reduction to try to pursue the data distribution

and also transform it and that's the one way direction there's almost there's no theoretic guarantee either your data is

sufficient to identify that some you know you may started with very very few samples there's no way the data is sufficient I mean the apple there are five types maybe I only say four types

right so and then but that process goes on you do compress what you have and the reach memory right and there's no guarantee the even during that process

you may not get stuck maybe there's not enough iterations You may so memory you get may not be accurate may not be correct. Hence, how do we check? How do

correct. Hence, how do we check? How do

I further develop, evolve, improve my memory or make sure my memory actually be able to authentically predict this is a worth model the model is accurate,

right? Uh so you actually have to decode

right? Uh so you actually have to decode it. You can think about the memory

it. You can think about the memory formation is an encoding process. Then

from my memory, you want to decode. I

want to predict what's going to happen next second from what I observe right now or at night I may want to dream what happen right so that's actually the decoding is actually allows us to check

if my memory serves to be right right how accurate I can predict next step and so hence this actually already form a sort of autoenccoding framework

now of course autoenccoding if I have access to both the observation and my memory just like our training our big data models I have control on both ends

I can just force autoenccoding back end to end right the people like to talk about it but in our natural um in a natural setting in animal or human

setting we don't have control on both ends right we only have control probably a control of our own brain what's inside our brains right we never really quite

have access to measure if are my prediction of the 3D world for example Right? The the the frame of the picture

Right? The the the frame of the picture is rectangular.

Do I ever measure it?

Right? You don't have to measure it. But

somehow everybody believe you know the model is correct. How do we do that?

Right? Hence there's actually a self-correcting process. In fact, you

self-correcting process. In fact, you know, this is actually probably the idea goes the idea actually goes back to Noble Veiner, right? And how animal be able to correct its error without see cats can capture something very

accurately. even make one single

accurately. even make one single mistakes they can correct that very clearly. Right? So somehow they they're

clearly. Right? So somehow they they're be able to build a world model very consistent self-consistent with the world without actually physically measure their errors. Right? So hence

this is idea about you actually loop it back to your brain and close the loop.

Right? and allows you to constant predict and based on your prediction and your observations check if there's a difference still difference between your

predictive prediction and your observation within your brain very if there's error and using that error to correct turns out um this is a

work with my student turns out you cannot do this of course the our observation will lose information right why uh where introduce noise or lose lose dimension, lose information. But

turns out as long as the distribution of the world, the data, the distribution of the world is low dimensional enough.

Even your encoding process, your observation pro perception process is no, you this is still doable if the precisely when the distribution of the

data outside world has enough structure, it's highly low dimension and hence your brain has enough degree of freedom to discern any differences. So this is actually quite interesting revelation to

for us to realize that the no dimensionality is not just some or technical assumption it's actually necessary for this kind of closed loop learning to be possible and once you be

able to close loop then you actually constantly observe constantly predict hence you can constantly use your memory to predict and correct it hence support

continuous learning even lifelong learning right our memory Rome is not built in one day. Our memory is never built in one day. We constantly improve

it, constantly revise it. And this is the mechanism of intelligence.

Hence, this mechanism is itself is already generalizable. Hence, you don't

already generalizable. Hence, you don't need to add the adjective general general in front of intelligence.

There's no point of calling general intelligence. If you implement the

intelligence. If you implement the intelligence mechanism correctly, it's already generalizable. The knowledge

already generalizable. The knowledge learned by this mechanism at any point of time may not be generalizable. The

mechanism does right this is a very big confusion. We think if I accumulate

confusion. We think if I accumulate enough knowledge it's generalizable. No,

it's not. Will never be. Any scientific

theory by definition being scientific is a falsifiable which means it's limited, right? Can only explain the world up to

right? Can only explain the world up to certain point or certain accuracy.

There's always room for improvement. The

scientific activity, our ability to revise our memory to acquire new memory, that is a generalizable ability. That is

intelligence, right? Through natural

selection early days, through our feedback control, feedback correction, through the human history of trial and error, imp accumulating empirical

knowledge, through scientific discovery is all doing this, right? That is common behind intelligence, right? Not the

memory accumulated up to a certain point. So even we manage to memorize the

point. So even we manage to memorize the whole world the knowledge we have in the whole world we will no longer be able to apply when we find oursel in a new environment in a new situation observe

some phenomena we have never seen before right hence that's the limitation of you know you try to gain general general intelligence through just accumulate enough knowledge right

>> we should talk about your crate series of architecture so crate stands for coding rate reduction transformer and you made some very interesting discoveries So for example multi head

self attention can be derived as a gradient step on rate coding and also MLPS as spifying um spification operators and and also you were talking

about how something like a transformer could be described in a principled way.

There there's this interesting thing, isn't there, that we we we designed the well, we didn't even design them. We we

kind of empirically tried with lots of different things and we happened upon the transformer. But something like that

the transformer. But something like that can actually come about from a first principles approach. If you look at the

principles approach. If you look at the past decade or so evolution of also it's kind of naturally selection process for

the big models right from early days Alex let l Alex letter VG or um then reset or uh transformers by the way this

is just one of those for survivors right as I said just like lateral selections right remember don't people don't forgot there's a there's a time there's a very

popular areas called Autoas, AutoML, right? People tends to do random search

right? People tends to do random search for better architectures, right? Somehow

why only a few survive? There must be a reason, right? They must capture certain

reason, right? They must capture certain structures. They must did something

structures. They must did something right. Now from our understanding so

right. Now from our understanding so far, the resonate actually capturing the the fact that each layer should be doing compress doing optimization.

The resonance precisely reflect the iterative optimization architecture.

Right. And

precisely capture fact. We're trying to cluster compress what's similar and discern or classify what's different or

contrast what's dissimilar, right? And

you want to develop different experts, right? We call them experts, we call

right? We call them experts, we call them as cluster, we call them group. So

be it, right? And the transformer again, right? Let's capture what is the

right? Let's capture what is the correlation self attentation. is

precisely compute what's what is the correlation in the data coariance in the data what's correlated and using that to further spify further classify things to

organize the the the distributions they must do something there some somehow close to something right right so also it's almost like a belief for us right if we believe there's something right

then we should be able to derive create from first principle have a very clear unified understanding I think we're sort managed to do that at least. So for the

not structure we discover so far provide rather unified explanation to what they have done to be honest the early even maybe our early earliest mo motive is trying to explain to understand what we

have done but once we understood it we realize that we can go much further right and realize even the current architectures there's a lot of room for improvement not only we can dramatically

simplify them you can see in the past after the create um in the last yes last year and this year there's a series of work from my group right really just showcase people right uh you can

actually you once you understand what is done with the principle you can dramatically simplify you you can even throw with the MLP layer if you only

care about the compression um you don't care about the final representation and or you can make the attention head since we know what is optimizing it's

optim optimizing the rate reduction object function Then we can find what is the equivalent variation form of that object function which is much easier to

optimize. We end up with a we call the

optimize. We end up with a we call the toss right the the computing the coariance the the self tension step is only linear in the dimension no longer quadratic

like the current tension is doing. Of

course if you look at the literature there are other people have found tried to identify linear complexity such as uh

you know manga or uh I think there's the rw vk or something so empirically but again it's through trial and error but this is so now we derive this in the ma

purely mathematical way because we just find the equivalent variational form of the same objective function they have the same global optimal but it's just much easier Optimize. This is a trick we

do all the time, right? All the tricks.

See in the 200 years plus years of developing better optimization algorithm, all those ideas can help us now to design better operator descent

operators or optimization architectures to improve the design of current architectures. Pro honestly, we have not

architectures. Pro honestly, we have not really started that far, right? There's

many acceleration techniques, preconditioning, conjugate gradient which explore different landscape. Once

we understand the landscape, the type the cost of object function better there's gazillions of ideas, we can further improve the efficiency.

Honestly, we haven't started that far, right? Um I mean that's actually what

right? Um I mean that's actually what got some of my student uh excited uh to to pursue this realizing how much how little we have done. um uh from

optimization perspective how much room there might still be for improvement uh some my students are quite excited. So

we can see you know even within last couple years we only have two two or three different uh uh uh uh generation of architectures that um in the past

it's almost unthinkable because new generation always come from different group right it's like a random process whoever gets lucky maybe discover something works try hard enough get something to work

>> it's a tantalizing idea though that through this principled optimization there could be you know a convergent evolution towards the optimum architecture >> then the search will no longer be random

will be actually guided right there's still just like back to your earlier suggestion right this is become intelligent search is you know guided search we understand the the structure

of the problems now and hence we can do science now we are no longer just to do empirical inductive search process why do open AI they're still using the

transformer even though there are now superior architectures out there and we should talk about this token statistics um transformer. So as you just said it's

um transformer. So as you just said it's a linear time complexity which means in principle this is something which is going to scale dramatically better than the kind of transformers we're using now. So why aren't we using it?

now. So why aren't we using it?

>> Well there are tempted to try to scale this up. In fact, even you can think

this up. In fact, even you can think about uh uh many of of course when you try to scale there other factor comes in right in terms of whether or not the the

the the the scalability and so on is all related to to to all the design. Um and

um indeed we actually tried something else uh you know uh for example things are much more scalable also I also tried to we also scaled up with all the resource we have sometimes I don't know

about company right we we are very limited in resource to verify even our those architecture scales we can only do up to probably a few you know couple hundred cars and so on that's about it

uh with our academic resources and hopefully that will commence but one thing I think recently we did u um to simplify the current practice in Dino right you know the meta has to which is

a pre-train the state-of-the-art uh people everybody talk about world model visual world model and uh that's sort of the best model and the the meta put a lot of effort engineer effort to

pre-train the visual representation model uh which is still the best and they train on gazillions of images and they were using contrastive learning but it's a it's a very uh remarkable

engineering uh feed And now people using it, right? Um turns out actually we

it, right? Um turns out actually we found that uh the system can be dramatically simplified. Once we realize

dramatically simplified. Once we realize what the purpose what they're actually really trying to do, right? We have a work called a syneno simplified deno version one version two. We simplify

both version the architecture is dramatically. We get rid of you know

dramatically. We get rid of you know dozens of hyperparameters I have to do and architecture become you know extremely extremely 10 times simpler and then the performance is better we

managed to scale up to uh probably few few hundred millions of scales the appleto apple comparison were dramatically much easier to train much efficient everything is explainable I

think that has seriously draw attention uh from the meta team and also from the Google team and it's currently they are there I I know there are serious effort that uh they're trying to scale this the

new architectures up.

>> Yes. Um we interviewed the Dino folks at the time and we've spoken to people like Ishan Misra. There's there mean there's

Ishan Misra. There's there mean there's a potential tangent there about they're using this kind of um non-contrastive um self-supervised learning and also there's the whole unsupervised thing and

and how useful those representations are for downstream tasks. Maybe we could go there, but I should say that Kevin Murphy, I'm I'm interviewing interviewing him soon and um I know that he reviewed your your book very very

carefully and he asked me to give you this question. He said code reduction is

this question. He said code reduction is great but must be subject to prediction or reconstruction loss in data space.

How would you go beyond token prediction which seems especially weird for images?

So that that's what Kevin asked me to ask you.

>> So this is actually a great question, right? Um so in a in a rate reduction

right? Um so in a in a rate reduction remember the the lossiness is actually coded through the the epstone ball we

actually uh uh try to capture the uh the the samples how they connect with one another. Um right now we actually if we

another. Um right now we actually if we just minimizing the coding representation uh through this uh uh lossy coding and the error is kind of

controlled by the IPS law but not enforced right so we respect the IPS law um through this loss coding process now

to truly ensure remember everything could go wrong also depends on the number of samples you have maybe the image you choose is wrong because the data does not have that density so you

not be able to percolate. Hence the

reputation learn can be very very funky.

So now in order to to ensure that your your your your learn the reputation distribution learn internally actually authentically reflect the original

distribution up to certain precision you have to decode right there is a constant encoding decoding actually our brain do that all the time predict coding and so on so forth

and hence that encoding decoding and to verify if there's error remain in your prediction in your recon construction matters a lot only in that now the

question is if we don't have the oh do I do we necessary back to our earlier discussion right do we really need to measure that error in the data space in

the original token space uh if we have that option so be it do that right make the engineering simpler make the but if we really want to have a system just

like a human to selfarn just go out to observe right at with two eyes or with some sensors. Then we have to come up

some sensors. Then we have to come up with a way to make sure that our sensing process is accurate enough so that we can do

everything internally. We can predict go

everything internally. We can predict go back rapidly back and observe compare what we predicted and what we observed

through same sensing channel. We compare that locally. In

channel. We compare that locally. In

theory actually prove at least under idealistic cases this is possible. We

can minimize the error once we correct the error. Hence the internal

the error. Hence the internal representation will the error in a token original data space will diminish but under tech condition under general

condition we still don't know. We

actually have a paper prove that for the when your data distribution is a mixture of subspaces you can rigorously prove that's possible and if the dimension of the subspace is low enough compared to

the capacity of the perception process.

Now for general distribution we believe this is true. This is actually how we'll be able to learn all the low dimensional dynamics structure in the natural data

in the motion in the in the predicted um world. So I think this is something in

world. So I think this is something in the future we can decide but end to end works if you have the option to do so or if you have you don't have that option

you have to figure out how to do this autonomous under what condition you can do this autonomously and allows you to do autonomously to reduce the error to almost zero.

>> We spoke about Dino but another example would be VIT. Um, we interviewed Lo Lucas Bayer in Switzerland earlier this year. So, he invented VIT and if I

year. So, he invented VIT and if I understand correctly, crate is now very very close to VIT, but it it's so much more principled. It's explainable and so

more principled. It's explainable and so on. How close are we to knocking VIT off

on. How close are we to knocking VIT off the off the leaderboard if you like? In

fact, u I think in many of the comparison we're already very close if you compare it's hard to compare apple to apple but in terms of if you similar uh the parameters we're very much on par

and also by the way we never really quite put much engineer effort into we just want to verify the concept um indeed there one thing come out of the

vi the crit is that uh what we find is not only the architecture designs principled but then once we did the training right The internal structure learned

are both semantically, statistically and geometrically very meaningful. Indeed

each head actually does learns uh you know similar structures all gets basically each channel each head truly become expert of certain type of visual

patterns for example legs of animals ears of animals faces of animals. So we

see that very very clearly with crates but we don't observe that in the vit of course you can see vit may learn this is actually the the the interesting thing

right early days people I'm sure large models if they have redundancy they definitely learn things internally but it's very hard to say which part of that network learn the correct channels learn

the correct operators because it is embedded in a more some redundant structure right so early days people call this lottery uh you know lucky

lottery or lock lottery ticket right it's somewhere in there right then then people try to distill that that justify you should distill you should actually

be able to compress even people do this Laura thing you know all the post-processing right justify that is necessary and some people you find after the pro post processing not only the

network become smaller the performance gets better right and so on so forth now probably we don't have to do that at least you know the the architecture does what is supposed what it's designed to

do right and we can actually at least explain what each component is doing something statistically geometrically very meaningful and there also results if there's enough data if your

optimization uh is done training is successful that's those structure pops up naturally the the structure will do what they're

designed to do >> and and final Many um ML engineers and researchers watch the show. Given

everything we've spoken about, how can they find out more about your work and how can they get started building these kind of architectures? Um I think most

of our actor are open sourced on GitHub um including create uh uh early reduet there may not be is conceptual but not

very practical create and also even toss all the all the codes are available but by the way they're sort of kind of academic implement we don't we never be able to have the resource to scale them

up most are scale up to GPT2 or image 21 that that's all we can afford simply D is the one we scale the most.

We we exhaust a lot of resource and a little bit higher than that but still no comparison to all this industrial scale at all but I do believe that the meta and Google are doing something about

dino simplified dino and the codes are there um and also of course if for the methodology of course um this is one way why we bite the bullet uh uh to uh wrote

the book in the past two years we believe that although there's a series of papers uh but we believe that for people to get a big picture, the more systematic introduction. We put together

systematic introduction. We put together the the books also we open sourced it.

Um and we we will post link all the data, all the code as well. We are also teaching the course. So all the we actually will have students uh practice

most of the new architectures method. So

all those codes will be made public available and so I think that might be a good entrance if you want people want to learn the methodology understand the theoretical chain of evidence and also

even the empirical chain of evidence and I think the book is has is attempt to do that we are already start to organizing we're not done yet but we are start already organizing if you find of

chapter seven we're already doing that uh in chapter seven to collect the theory seriously ly to to all the real world data and the task such as image

classification, image segmentation and the pre-training and even language uh GPT2 type scale language models as well.

>> Professor Ma, it's been an absolute honor. Thank you so much for joining us

honor. Thank you so much for joining us today.

>> Yeah, thank you very much.

Loading...

Loading video analysis...