Two Clouds Over Agents: Real-time Interaction with the Environment, and Learning from Experience
By Bojie Li
Summary
Topics Covered
- Two Clouds Plague AI Agents
- Interactive React Enables Thinking While Listening
- Fast Sync Slow Thinking Mimics Humans
- Three Paradigms Advance Agent Learning
- Externalized Learning Scales Beyond Transformers
Full Transcript
Hello everyone. I'm Bujin Lee, the co-founder and chief scientist of Pine AI. Today I'm honored to give a talk to
AI. Today I'm honored to give a talk to everyone about the clouds over the agents. That's the real-time interaction
agents. That's the real-time interaction with the environment and learning from experience.
So we know that there was a famous word from Lord Kevin Kelvin that the physics knowledge are almost complete and only two small clouds remain over the horizon.
And actually in the agent we also have two clouds. The first cloud is real-time
two clouds. The first cloud is real-time interaction and the second cloud is learning from experience.
For the first cloud about real-time interaction we know that uh this there are lot of high latencies in the voice interaction and the voice interactions
feel very unnatural some of the times.
So it looks like a robot not like a real human. And for the second there is some
human. And for the second there is some learning from experience. For example,
if you build a agent with cloud or gymn and it starts each task from scratch, it does not have any ability to learn from
the existing task regardless of task success or failure.
So we propose several paradigms for agents to learn from the experience. And
we also propose a new architecture for the streaming event driven agent loop which enables interactive observation syncing and action. So uh let's go into
the details. So before we go into all of
the details. So before we go into all of this kind of details we we will refer to a tube clause the problem actually not
pointed pointed out by me. It is pointed out by the open ed research scientist in Yao. Actually there are uh two two two
Yao. Actually there are uh two two two kinds of uh problem. The first problem is the uh evaluation that should run
automatically.
That means all these kind of evaluations does not have to engage with human or does not engage with a real environment.
It only just uh performs a lot of two calls and encompies but it there is no human involved. And the second uh
human involved. And the second uh problem is actually exactly our second problem that is does not have any kind of mechanism to learn from the
experience. For example, uh he he gives
experience. For example, uh he he gives a example if we have a test test test set of 500 tasks and you can ever uh you
can just execute this kind of task each one by one independently.
But uh this is does not seems to be similar to a human. For example, a cumin software engineer can just solve this kind of issues faster and faster and as
long as you get more understanding of this repository and about the guidelines and coding kind of guidelines inside the organization.
So these are actually the very hard problems that kinder of all these kind of agent deployment in
in the real world. So let's tackle this kind of problems. The first part is agent realtime interaction with environment. The environment means
environment. The environment means humans the internet or the physical world which means it interacts with humans with real time voice and
interacts with the internet through the uh computer use through the web browsers or the GU or interact with the physical world through the robots. That's the
future work. So for the first time the simplest one the agents interact with the environment is the voice agents.
That means we have the real time interactions with real humans.
But actually we know that we all know react is reasoning and action is actually a loop of observation and reasoning and action. But the latency of
this is very uh very high. For example,
we have live language models. The live
language models need salts and the source the thinking needs more than 10 seconds for some set of are these kind
of thinking models and also reg we if we only optimize the LM part it is not suffice because we also have the
perception stage. In this stage we we
perception stage. In this stage we we need to detect the activities of the user. We do need to know when the user
user. We do need to know when the user finishes speaking. The reason is that
finishes speaking. The reason is that the most of the speech recognition models are just
take as input a piece of audio and output a text. It is not streaming. So
you need to wait for the speech and event of the VA model which is which means voice activity detection and then feed that into the ASI model and then
feed the complete sentence into the large language model which performs the thought process and then split all the responses of the large language model
into this kind of sentences and then into a texttospech model to the to the voice response.
So the cumulative latency of all these kind of this is far exceeds the human tolerance. It often
tolerance. It often is as long as 10 seconds or even more.
But if we just want to optimize for example the large language model part and to reduce that into the latency of 1 second we do not want any kind of
thinking. There is still more problem
thinking. There is still more problem because for example uh if the someone just offer offers you a plan it wants to
say the plan is very good would you like to accept it and the AI would say it's very good let's just accept it so this
is without agreeing to a not very good plan without any thinking so how can we just balance the dilemma of the fast and slow thinking this is the harsh thing we
want to tackle and our architecture actually composed of three layers. The first layer is perception layer, the second layer is
thinking layer and third layer is execution.
So for the first layer we convert a continuous real world signals for example the voice ai video into a stream of discrete events. These events are
easier for the syncing layer large language models to process. It's just
some tokens and the syncing layer will process these kind of events in synchronously.
And the most important thing is that it enables the thinking while listening and speaking. For example, it just of uh
speaking. For example, it just of uh when you are just thinking and you some events come, you do not interrupt your thinking but you just accept this new
events insert it into the syncing process. So this generates a interl
process. So this generates a interl sequence of input events that means the events received from the real world from the
voice and also the internal source and also the actions. This is an interative sequence. We will go into more details
sequence. We will go into more details in the next slide.
And third, for the execution layer, we will convert the discrete action commands back into the rework signals.
For example, the voice the through the text piece to speech model or the mouse movements through a VI model to operate the computers.
So first for for the first layer we go into the perception layer which converts a continuous real world signals into the discrete events.
For example, we first go into the limitations of VD and ASR. Uh for example, we know that VD is
activity detection and ASR is automous speech recognition and uh there is significant latency accumulation problem
in that. For example, there is VD
in that. For example, there is VD detection latency because we need to know that this user has stopped speaking
and we need to wait for 500 milliseconds or even more to ensure that the user is not speaking again.
And after the user has stopped speaking a word and that's already five mill 500 milliseconds and then it goes into the ASR model and
ASR just process it with for example 100 or 200 milliseconds and this is some accumulated latency also there is some uh another problem
for example information loss it only output the binary signal and for the VD model and also the ESR also auto only
output the text without any acoustic details. For example, if someone just in
details. For example, if someone just in a bad mood or good mood. Yeah. As cannot
determine that there is some background music or some background noise. Yeah. Or
even the user has some uh uh simple aha this kind of sounds it cannot capture that. There is information loss in that.
that. There is information loss in that.
And also we have error propagation here.
The error propagation is that because the uh VD models are quite small. These
are just quite small models that only runs in the CPU with just 1 millisecond.
So the accuracy of that is not very high. If you tune the threshold to very
high. If you tune the threshold to very high threshold, they will miss some simple words for example hello and aha.
And if you just turn this threshold to very low, it will include some non-spech voices. For example, if you just knock
voices. For example, if you just knock this kind of uh table or there is some uh yeah, it would be treated as speech.
This is not very good. And the last thing but also a very important uh limitation is the lack of context in the speech recognition models because it
just recognize the speech in the segments. The VD model has just cut a
segments. The VD model has just cut a complete paragraph into short sentences into these short arteries and the AS
model ASMR model does not have any context to understand this partial sentences especially when we just for example recognizing the addresses the
brand names the personal names and also some domain specific terms it does not have any context understanding that for example a simple example if just we know
that a person is called Bji. It is very hard to pronounce in English and uh in in some context the just associate says
the user's name is BJ and also says that email is b o j i e l i gmail.com and yeah because everyone here can
understand that because we know that bji is a name is a speaker yeah it is already written but uh ASM model does
not know that so the error rate of this recognition is very high. So these are some problems of existing approaches
and to replace these to to to resolve these kind of uh problems we need a new streaming speech perception model based
on the auto reggressive light language models. We know that light language
models. We know that light language models are very efficient in handling this kind of contextual information. Especially for
contextual information. Especially for example the contextual information about the personal information or the domain
specific terms and also some uh brand names and accounts uh among of the prices. This is this is a better to
prices. This is this is a better to understand for if we just use a large language model to do that instead of simply relying on a small model for
example whisperer or sense force to do that and also we need would like to why we call streaming approach we do not
want to cut this input audio into the smallerances into the small segments and put into a standalone model we would like to build a multiod model model that
means the incoming space tokens are just coming into a streaming manner and the text and other acoustic events I just output in a streaming manner. For
example, it does only outputs the text tokens but also output a lot of special tokens that means acoustic events. For
example, when the user starts a speak finishes speaking, it also issues a speak start and speak end events. These
events are quite similar to the VD model what it outputs but it is much different because VD model is a very small model that just determines whether some sound
looks like human speech but this spot uh speaks spot and speech end mode uh events are produced by a lot language model it understands the meaning of the
uh user's utterances so the accuracy of these kind of start and speech events are very time and also for example the
user uh may have some interruption intent. If we tell the model that the
intent. If we tell the model that the currently the AI is speaking now and then this model we produce a interrupt
token when it determines that the user needs to interrupt. However, if the user does not interrupt, for example, it just simply says, "Aha!" All this kind of
thing, it does not need to recognize this. So, this there is no interference
this. So, this there is no interference intent.
And in addition, for example, there is some emotions and other kind of information, for example, laugh or music or this environmental sound, there is
background background noises. we can get this kind of information not only the text but also the acoustic events.
Well, someone will just say that why do we not use end to end kind of model that means speech in speech out just audio in to the audio tokens and then out or why
we need the text at all. Why we do not convert uh LM and TTS into just merge that into a unified model. We admit that
this is maybe the future. uh we think that the future will be multimodality but this is just in theory but in
practice we have experimented with a lot of the closed source and open source kind of models but the uh ASR and LM and
TTS kind of seri may perform better especially in the terms of intelligence.
So why there is intelligence drop when we have multimodality?
The problem may be from the modality conflict. For example, when we train the
conflict. For example, when we train the model, we do not have enough training data about the voices.
So we need to first train a text only model and then add the audio encoder and decoder and use some small amount of the training data to train a protection
layer among them. So this will result in some parameter divergence. For example,
if we use ae model, the mixture of experts and some of the experts will just specialize in handling speech. This
is what we what do we we do not want to know because if we the experts are just specializing handling speech, it is just a separate path from the main
intelligence. It is lack of real
intelligence. It is lack of real reasoning.
So the reasoning ability inside the speech task will significantly decrease.
So to resolve this kind of problem and we just use this kind of seal architecture of a pipeline per perception layer, reasoning layer and
execution layer instead of single standalone end to end model.
And let's go to the layer two. The
syncing layer a studio achieve a interruptible and asynchronous model which accepts uh listening events
the acoustic events and text transcripts and while the it is continuously syncing and also it can speak while think uh
syncing. For example, it can give a
syncing. For example, it can give a short response in 1 second and then sync for another while and give a longer response and more correct response after
that.
So this input is just a stream of input events including the acoustic events and transcripts and outputs are the GUI operations and auto speak events
including the text and emotions to speak.
So the uh the core innovation here is that we avoid the rigid observation thinking action loop of the traditional react. For example in the traditional
react. For example in the traditional react we all know this is a traditional agent uh architecture.
It is just a region loop of observation that means the transcripts and also the thinking. The thinking must be completed before the action. Now
action means to the something to say and if we just think and we cannot we do not have any choice of to not do
any action. For example if human says uh
any action. For example if human says uh please wait and after think it must output something. So we must do
output something. So we must do something instead of waiting we not must must say something. This is this is contrary to the users's requirement do
not uh do do not say right and also if we are thinking and another user's utterance comes for example the
user just says some additional things and the existing thinking will be dropped it it will just uh uh we need to think
from start we need to think from the scratch I think again to get the next thing to do. So this is observation sync and act and the observation again and
sync and act and also the most critical thing is that the sync latency is very high. Uh this is what we just said late
high. Uh this is what we just said late uh earlier in several thr is a high response latency problem. So in our
interactive react uh approach we propose some flexibility inside the interled kind of observation syncing and action loop. We in this way
we can adjust offer the thinking while listening. For example, the new
listening. For example, the new observations can be inserted into the thinking operation at any time and also the speaking we can just respond
immediately after the quick thinking and then we continue thinking and perform another response again.
So this kind of approach maintains a unified thoughts and it continues this thinking. Why this works?
thinking. Why this works?
So we basically observe the LM processing speed. We know that for the
processing speed. We know that for the preview for input tokens we can achieve more than 500 tokens per second for the
input. And also the SA models including
input. And also the SA models including gim 2.5 flesh and some other models can achieve more than 100 tokens per second
through uh throughput output.
But voice input and output does not need so much tokens. For example, a normal human speech just have five tokens per second
and the input and output only takes five to 10 tokens per second.
So there is more than 90% of time idle if we look at the whole time time stream and it has a redundant time of thinking.
So the problem is that we should leverage the gap time outside of the off output input and al outside the voice output and leverage all this
time for the internal thinking. This is
a core problem because the traditional react approach just want to leverage the time between the observation event and
your actions and you need to reduce the latency from the observation to the latency uh to the actions. So you have very low time of thinking but when the
other partner when uh when associate or when the user is speaking you do nothing the RM is not running at all.
So actually we want the light language model to work in parallel with the user and with the TTS is speaking a lot we
want to sync in the background. So this
means the fast syncing or slow thinking this kind of thinking layer. Yeah. Uh
actually uh for example uh when we train the last jump models to do this kind of thinking we tend to have a fast response
when we observe any kind of external event from the other party. For example
the user say something and we expect to get the response response in 500 milliseconds. So we reserve only 50
milliseconds. So we reserve only 50 tokens for that and after that we have a more in-depth analysis. For example, we
give it 500 tokens for a slower sinking and it will just output another thing.
And for example for some very hard problem even 500 syncing tokens is not enough at all. So
we need to think twice and think third times or even more. So we need more thinking tokens to do more
harder tasks. So when the multiple
harder tasks. So when the multiple rounds of thinking are needed the result will conf will just output a summary of the current thinking. This is anal
analogous to the human beings because human beings will will say something while it is thinking. it will not just do complete silence.
So this is uh how the fast syncing and slow thinking and continuous thinking works. So let's see a real world
works. So let's see a real world example.
For example, if the user says that I want to change my plan from the current $19 to a new something and then the the
there is thinking that user I want to change the plan. So I want to know the user's current plan details and new plan prices. These are the internal thinking.
prices. These are the internal thinking.
But before this is actually started the user may interrupt and say that uh another word for example by the way is a
new plan one that's 90 uh $79 per month and then we'll continue the previous thought. This is very important because
thought. This is very important because the thinking internal thinking is not just dropped and start a new thinking.
It is just marked as interrupted and the new observation just plugs in and then the you the uh the interrupted thinking simply continues
and after the sinking it will just uh respond to the assistant output the assistant action that means respond to the user uh about the uh final thinking.
This is important. So this is gra gracefully handling the interruptions inside the conversation
and also there is another part that is speak while thinking. uh because the last part I say is I think while listening and another another thing is
speak while thinking that means we first give a quick confirmation or quick response and then sync and give a longer response.
For example, a traditional way is that the user just wants to confirm an order and there is a long silence of up to 10 seconds and the system just said confirm
or not. This is very unnatural. But for
or not. This is very unnatural. But for
the interactive react way, the user may just ask do you want to confirm and the sync just use 50 at most 50 tokens to do
a quick sync and just up uh to say some filler words. But the filler words is
filler words. But the filler words is not using the engineering approach. Just
say some let me think. Uhhuh.
This kind of fixed filler words. It is
generated by the live language model and it is cont contextualized. It understand
the current context and say something that is useful just not waste the time.
For example, in this case, it just say let me just confirm that is the 79% dollars a month plan, right? And this is
a preliminary response. And after the indepth sort of at most 500 tokens, it gives the final response.
So this is a speak while listening uh sorry speak while thinking and for more detail sequence
it involved all uh two uh these two aspects thinking while listening and also thinking while speaking. uh then
let's not go into not go into the details into that you can speak uh talk about this offline and also this is kind of interruption
handling example yeah we we have the interruption and resumption in the real time conversation and so I we can see that that as uh as a
conversation goes longer and longer the internal sauce is a very hard burden Because we have said that if we only
take the input voice and output voice is only five tokens per second. But if we take all these internal source into here
is 100 tokens per second. So is 200 uh 20 times more tokens. The motor tokens is good for maintaining the consistency of thinking and fluency of thinking but
it's not good for the price because we need to we need a very hard very high cost for this very long context. So we
need to truncate this kind of most recent trends of this recent thinking and also we for the older trends we simply omase this kind of internal
thinking and we need to uh do some windowing approach to make sure that it is friendly to the KV catches and uh uh for this kind of adaptive
thinking model actually uh we need some training to do that. If we just want to adopt a offtheshelf model to do that, it is not very good at understanding
interruptions continually syncing while listening when some no interruptions just come or it is not very good at
uh outputting this just a quick sauce and then followed by a long sort. It is
not good at controlling it internal reasoning to 50 tokens to keep the initial response below
500 milliseconds.
So to achieve that we need to have some uh extract some data for example using the cloud for sonet to generate this kind of training data and this uh
including the alter alternating thinking and assistant kind of texts. uh and we just output this kind of syncing data as
a content part because the major models including the openi gymni and this cloud has some source summaries this has no
original syncing tokens there the cloud for sonet is original syncing tokens but other models are not so we need to dist
generate the test cases for the sync and assistant X uh we need because we are using a S model we can use a lot of prompts to control that to and use
specialized test cases to let the model generate this kind of behavior just as we can uh already observe in in the previous examples.
So uh in this way we can gen uh generate a lot of context uh supervised fine training data for the models and then we can use reinforcement learning to train
a new model for example based on uh sort of can and low latency opensource model.
So we get for the adaptive thinking and for the thinking while listening and thinking while speaking and for the data engineering we need to
solve lens control as we have just said in the last slide and also we need uh several different kind of scenarios including the simple greetings the complex decisions and multi-terrain
dialogue.
uh there is something we need to uh highlight the salt length must be considered as a part of the reinforcement reward
function. The reason is that because we
function. The reason is that because we have the long source here but if we do not control the soft length it tends to
use up all these kind of thinking budget window. This is not very good. For
window. This is not very good. For
example, for a simple greeting hello, do you need really need 50 tokens to sync?
I don't think so. It will be very simple, very just stink even no sync because it's really simple context.
And for example uh there we know that after the quick response it it if it determines it is a hard question really
do we really need 500 additional tokens use up this token budget to do the final response this comes after 5 seconds I
don't think so in most of the questions is not very hard so actually we can use the soft lens as part of the reward
function and to panalyze the simple questions that use a very long thought.
In this way, we can do a really adaptive thinking achieve this optimize for the simple questions to reduce the actual response latency for the simple
questions while we reserve the long thinking capabilities for this hard questions and for this mission critical questions. For example, do you want to
questions. For example, do you want to ac accept this plan? Do you want to lower this kind of thing? Do you want to accept this offer? And you need
something in some filler words and filler sentences to uh just uh you get more time, more thinking time for
you to evaluate that plan and finally make a decision.
And uh next we go to the layer three, the execution year. The execution layer it just takes as input of all the outputs of the last layer the syncing
layer. The last the major last language
layer. The last the major last language models. The major last language model
models. The major last language model simply act uh action is speak or click this kind of thing. Yeah. The speak is just to speak uh with uh to to say
something and the click means to click the computer use to click the virtual browser or virtual computer through a graphical user interface
and output is some continuous signals including the boys waveform and including the mouse trajectory.
uh so the uh we could first focus on the uh on the mouse problem. Uh I think there is uh uh for the computer use
agents uh we know there are a lot of computer use agents online and uh almost every sort vendor has produced their
agents but the common problem is that the agents are very slow.
Now why we agents are very slow? It is
because it takes as input a large screenshot maybe more than 1,000 or 2,000 tokens and you need per few time
to handle that kind of screenshot and you in internal thinking to generate a click uh position and then you need to
click that position. It takes more than four or five seconds. It's very slow.
But if we use a smaller model for example a 7 billion parameter model it can understand the uh screenshot well for example if you have one thousand
kind of token this kind of screenshot it knows that the click is there a search button is there you need to click that but the problem is a small models or
especially the models not specially trained for the computer use task often for the kind of have this kind of problem is that it model knows what to
do but the model does not uh cannot do that. For example, uh when it produces
that. For example, uh when it produces its internal reasoning, I know that I need to click this search button in
order to click it outputs very clearly.
But when you can see that the uh the output coordinates of the click operation for example the click XY just
want to c to locate the element to click the XY is often not the exactly the click but the search button to the click
fields. So why the failure uh of this uh
fields. So why the failure uh of this uh is this because actions speak conflict because the live language models are
trained on on this kind of uh operations that only text only but it does not have have any kind of understanding of the
screenshots. It does not have the
screenshots. It does not have the understanding of the xy positions the coordinates inside the screenshots.
So this is not good. uh and uh I we know that many many there are many kind of open source framework tend to tackle this problem by using uh just labeling
approach uh it just use the labels the bounding box as a grounding hint for this vision life language models how it
works it just take as input uh screenshot and we from the browser API we know that some of them are clickable elements
And we've uh just draw the bounding box of the clickable elements and mark them as one two three four give this labels and in the problem we tell that here are
the labels here are the bounding boxes and you need to output one two three four these kind of labels in order to click them. So the cognitive harsh uh uh
click them. So the cognitive harsh uh uh difficulty of understanding the exact coordinates just reduces to the
outputting the labels of the bounding boxes. It is very clever technique but
boxes. It is very clever technique but this the but it is very hard to generalize to general applications.
For example, there are many compl complex interfaces. For example, there
complex interfaces. For example, there are the Gmail interface. We can see 50 kind of emails in the uh
in in one sequ in one screenshot and there are many labels, many tabs and many action buttons. So there are more
than 200 actionable elements inside the simple screenshot and the bounding box just overlap it with each other which
makes the uh screen very hard to understand very and and there are even something harder for example many kinds of editors
including the text editors the rich text editors the document editors the Excel Well, this kind of spreadsheets and the PowerPoint and also some video
video editing software there's all kind of editing software you cannot use even have any job bonding boxes. So you must
know the exact coordinates in order to operate these kind of editing softwares.
So to resolve this kind of problem we need to train a end to end v model. Uh
so the V is comes from the robotics terminology and actually it the GUI uh operations uh kind of task or the
computer use task is very similar to a VE task to robotics task that interacts with the real physical world because it
just a v vision you need to know some screenshot or you need to take a photo of the outside environment and you have the language uh language means the
objective you want to achieve and you have actions. The actions in the
have actions. The actions in the robotics terminology that means uh for for each kind of uh uh for your arm for
legs. Yeah. How much angles you want to
legs. Yeah. How much angles you want to move. In fact uh for this kind of uh
move. In fact uh for this kind of uh computer use it simply means to click on this single click on this coordinates or double click on this coordinate or right
click on the coordinates or click this keyboard combinations. Yeah all the so
keyboard combinations. Yeah all the so on so forth.
Uh we have two ways to train these kind of models. The first option is to train
of models. The first option is to train a main model to directly output the MOS click coordinates.
uh that means a major LRM just is trained through the reinforcement learning to generate a massive number of test pages for the reinforcement
learning and they can we can leverage the large language models coding capability to generate a lot of pages because of mouse click is very easy to
verify through the simulated browser environment. It is a very great uh way
environment. It is a very great uh way to reinforce and that's just the major LM to draw out and needs to uh learn how
to ground these elements into the exact coordinates.
Uh but for this to work we need to make sure that the screen resolution is fixed. So that's the exact reason when
fixed. So that's the exact reason when we use openua or gymn 2.5 pro or use the cloud for
sonet these are the models that the reinforcement learning post train for this kind of computer use task we need to fix the resolution to one of their
train values and uh there is another uh uh uh pit pitfall because we we do not have the most movement trajectory.
This is very similar to the web pro bes uh sorry the crawlers the crawlers just use this kind of play right to locate element and click there the cloud is the
mouse is not move there but it is simply just click that location and regardless of the current mouse location.
So some of the uh captas just use this kind of uh behavior to determine the bot. So this is very hard for the agents
bot. So this is very hard for the agents to navigate through these captions.
And also uh we know that the major of the of either voice agent or text base main agent or computer use agent it will
be a very large model. So doing IR on that reinforcement learning on that is very costly.
So we comes with another option that is to train a separate VE model in conjunction with a main like language model
to imitate the human mo movement patterns. By imitating it means that
patterns. By imitating it means that it's just a move fine twin and click similar to a real human real human moves
most uh sorry for example when real human moves mos it cannot just simply tell you exact location tell the x y
coordinates of that mo instead we just simply generate the data of current mo location uh to the final mos location
similarly generate a kind kind of move trajectory in this trajectory and uh it is similar to the real world trajectory
of human. So if I if we just use real
of human. So if I if we just use real trajectories for as a training data so it can not be identified as a robot uh especially by passing most of the
captures.
But for this to work, we will need a low latency V model that we can quickly adjust most pointer position based on
the current kind of uh screenshot based on especially based on the current position of this uh and the target uh
element position and it is to quickly adjust the next action based on the relative position of the MOS pointer and the target element. ment.
So you need a model to be small but but small model works because it only needs to do the grounding task. It does not
need to do the reasoning task uh because uh uh because objective it takes as input is simply the the output of the
main life model.
For example, clicking the product in the third row and the second column or to click the search button or to type into the search box. Yeah. Yeah, this is very
simple. So, we know that grounding is
simple. So, we know that grounding is simple because uh the traditional restnet kind of models can produce quite good grounding results. So, if we
fine-tune a small uh BE model to that, it it works very well.
And the in the next step we want to have a a better text to speech model. Uh the
previous slide talks about the VA model that operates the computer use and for the text to speech model uh we we we are
targeting the real time voice interaction uh through the phone calls.
uh we know that a lot of uh current uh kind of TTS models are not so natural when generating voices.
Uh the major major problem is that it's too perfect because there is no such kind of pauses, no filler words, no repetitions. But
real humans have all all kinds of this.
The major source of this pauses of fer words is because humans think sync slower than the machines
because AI sync is very high syn think syncing speed is very fast is 100 tokens per second. We cannot imagine any human
per second. We cannot imagine any human syncing 100 tokens per second but real output is just five tokens per second.
So it as is actually over this wound it looks like a podcast not like a real human. So we actually needs the main LM
human. So we actually needs the main LM to generate some cognetic pauses before the text to speech generates that
speech. For example, we can generate
speech. For example, we can generate this kind of sentence transitions, the searching for words or we have the TTS model to generate these kind of audio
based on this text and the control actions. For example, the sinking
actions. For example, the sinking certain uh uncertain and uh searching these kind of things. We have this control text
for example we al we only have the thinking and searching uh text that means fther words but also in suicide
the emotion and speech rate for example speak slowly or speak faster or we have happy or we are very angry also have different speaking styles or
even it can output some special sounds with this kind of special tags of special tokens outputed by the main ling
model. The TDS can render a much more
model. The TDS can render a much more better kind of natural speech that is similar to a real human and it is easier
to bypass a robot checks because most people do not want to speak with robots.
So if your voice agent just sounds like a robot, the human is likely to hung up your call. This is the reason we want
your call. This is the reason we want the speech to sound natural but not very professional at a professional podcast.
So here comes the architectural summary of SE which means we have the uh real time interactions with agents with real humans.
We have the prop perception layer, the syncing layer and execution layer. The
perception layer takes an input of the voice and GUI changes and output is discrete event stream including the
transcripts, the UI changes event and also uh the acoustic events and the syncing layer takes input of the
perception layer and it outputs interlinking and action commands.
The most important thing about the thinking layer is that we adapt a reactive think interactive react kind of paradigm to resolve the sequential par bottleneck
of the traditional observation thinking and action. This kind of loop we can
and action. This kind of loop we can allow the observations and actions to be interled with the thinking because the thinking takes the most of time and it
is happening continuously.
So we can do the same thing while we are listening and also we can do the same thing while we are speaking.
The whole process is totally interruptible. That means uh the new
interruptible. That means uh the new events can be processed immediately without interrupting the existing syncing process and also it's fully
asynchronous and it can output uh multiple runs multiple turns of syncing and uh output events output actions
instead of the current syncing kind of models because the current syncing models simply sync and action and it is over. But we can think action and think
over. But we can think action and think again and action again and so on and so forth until it feels that there is no nothing more to do.
And for the execution layer, it takes the action commands as a of the syncing layer. action commands including the GUI
layer. action commands including the GUI operations including the text to speak and also including the control commands of the text to speech models and it
outputs the continuous signals including the most movements. We know that we train a
most movements. We know that we train a v model to make the mouse movements similar to a real human. It is
continuous MOS trajectory instead of simple mouse click event. And also for the TTS we output audio tokens which is
then FA decoder audio decoder to transform to the waveform through the melogram.
In the future we would like to outlook to see a real world uh to to see end to end model. Yeah, because we now have the
end model. Yeah, because we now have the acoustic event, the text and also the uh actions as a intermediate way to
communicate with kind of order. But uh
this definitely there is some we duplication in the three large language models because all the layers of perception, thinking, execution and large language models. The single layer
is the largest L model. There are some duplication on board and also there are some loss of information uh because we need to uh make all these kind of
multimodality information into the text.
So in the future we would like to have a consolidate all the three layers of large language models into a single unified one and the perception layer and
the execution layer is simplified as a simple audio encoder and decoder but this is remain to the future work. We
expect uh the audience to explore that and uh this is a future outlook of our real time agents that are three levels
of AI agent interaction with the world.
We have talked about the real time voice calls and graphical user interface operations the computer use and we know
that the voice have a data density of tens or even for example 50 or 15 kind of tokens to per second and the GUI
operations is much harder because it is keeps changing the uh if for example if we want to uh
uh do a realtime gaming of GUI kind of operations. Uh our cannot do that
operations. Uh our cannot do that because the graphical user interface is changing too fast and our V model or our
major large language model cannot keep the space keep the speed with that. But
in the future if we have a better understanding of of all this kind of task especially if we have a better
major language model to understand the quick sequence of the quickly changing GUI interfaces we can in theory we can
let the agents to play the video games the realtime video games with schemas and the next step even harder is about
the physical word because there are vision, voice and another modality that is touch and also
the for output. It is not only the mouse clicks, key prices and also not only the voice but there are also the drawn
action sequences. We need our sequences
action sequences. We need our sequences about how how much angle they want to move and or maybe we have the u coordinates the threedimensional
coordinates in the space and this is much harder and the latency requirements is very low. For example uh the latency we need less than 100
milliseconds of latency otherwise uh for example if we just touch something and uh it You do not have any kind of feedback
loop. You will rely on very reliable and
loop. You will rely on very reliable and highly accurate hardware which will be very complicated and very expensive. But
humans do not do things like this. Human
do not rely on this very expensive and very precise control as we just do not click the buttons by outputting the exact XY coordinates and we just do not
touch something by outputting this kind of XYZ coordinates in the space and we move to that it is a very fast feedback loop. So enabling this kind of fast
loop. So enabling this kind of fast feedback loop to control that this is very important.
Uh so this is the end for the part one the real time agents and this is next we go to the part two about agents learning from experience.
So the reason why the agents needs to learn from experience we know that there are very many sort models. So we just take a solid models here and give some
context and give some tools let it go.
So why do we need to learn from experience at home if we already have this sort of models at hand?
Uh the reason is that the sa model is like a top graduate. It is very knowledgeable but it lacks kind of experience.
However, in our business, for example, in Pine, it just helps busy users to negotiate with the customer service.
We need to verify kind of information.
For example, if before we make the phone calls, before calling the customer service, the com we need to anticipate what the customer service needs to verify their identity. For example, they
need the four last four digits of a credit card. they need their full
credit card. they need their full address or maybe needs a confirmation number or so on and so forth.
For example, if for the first time if we do do not know that and call the customer service and fails. So can we just uh when the another user just do
the same task, can we anticipate the actions of the customer service and ask the user before actually calling? So it will lead
to a much better user experience.
And second is for the task procedures.
For example, if we want to cancel a task for cancel a service and for the customer service say that you cannot cancel that on the
form. You need to fill out the form on
form. You need to fill out the form on the website. So the next time we do that
the website. So the next time we do that same task, we do not need to take the over the phone at all. We can simply
call the computer used to to do the kind of uh feeding kind of form on a virtual computer, it will be much simpler and much more
efficient.
And third is for the business rules. For
example, if we uh can antic anticipate which kind of discounts uh which kind of people can receive certain discounts for
example veterans or loyal customers for over two years and estimate the price. For example uh in for example we have a three kind of
gigabits per second com broadband and on a certain area for example different counties this is quite different we we we can do much better
strategy planning for this kind of negotiation and we can also control the user's expectations before continuing.
we can know what kind of preferences questions we need to ask user before actually contacting the customer service.
These kind of business challenges are dynamic and the actually the business decisions the rules the guidelines are not public disclosed.
So if we simply improve the general capabilities of base model it cannot solve this kind of experience based problem. So this is why we call the sol
problem. So this is why we call the sol models as top graduates. It is very knowledgeable but it lacks experience in
some specialized task.
So actually learning from experience is a very fundamental problem in the machine learning. We know that there are
machine learning. We know that there are three paradigms paradigms in machine learning.
The first paradigm is the old paradigm that means that through the changing the gradients including the pre-training and post training
the method is parameter update.
So the and second paradigm it comes through when the last language model comes we finally found that with large language models we actually do not need
to update the parameters. We do not need to do the finetuning for a specialized task. Instead, we can do the implicit
task. Instead, we can do the implicit learning through the attention mechanism. This sounds too academic, but
mechanism. This sounds too academic, but it's very simple. It just put the prompts and few short several few examples into the context and the lab
language model will follow your prompts, follow the rules and follow your uh examples to do predict the next.
So it simply use the long context as a temporary memory.
And this is second paradigm and almost all agents happen in this paradigm.
And the uh now we have the third paradigm that means externalize learning because we can externalize the knowledge and processes into the tools and into
the knowledge base. The knowledge base is textual representation of this kind of uh knowledge.
And the tools are use code to achieve the self evolution of the agency.
We will introduce all these three paradigms one by one in the following slides. And we we have the core insight
slides. And we we have the core insight is that previously we have only the first paradigm. We can only optimize the
first paradigm. We can only optimize the model parameters through the gradient descent. But now we have more ways in in
descent. But now we have more ways in in context learning and knowledge base and tool generation which is does not have any updates to the original model
parameters and much more efficient in learning a specialized knowledge in a specialized area.
So the let's go into the first method of the post training that means using the reinforcement learning to make the agents pro efficient in a certain kind
of task.
For we we all have talked about uh misform learning and some people may just think we can make the agents make the left language models very smart
through the IO but this is very hard actually uh some researchers have pointed out resol learning has is very hard to improve the
intelligence improve the raw intelligence of a model. The reason is that reinfor learning it needs something to reinforce. It needs to success. It
to reinforce. It needs to success. It
needs at least one successful uh output, one successful trajectory in order to do in the row out process and then you can give reward to this process and then you
can convert the pass K into a path as one that means if I have uh do the roll out for 32 times and find one that's a success and through all this kind of
training process we have reinforcement learning and you can have the 90% of this task succeeding. So from 3% to 90%
that is possible but if uh it is 0% that means the intelligence is not enough. It
does not have any possibility of outputting this kind of correct behavior. So you have nothing to
behavior. So you have nothing to reinforce.
So the true value of IO is that it to internalize hundreds of complex rules. For example,
this kind of no house in the specialized area. You can do the better instruction
area. You can do the better instruction following. For example, there is
following. For example, there is anthropic paper that takes about more than 50 strange rules about coding about outputting anything
irrelevant kind of rules into a cloud three hack model.
And uh because the hyper model is very small so it cannot follow all these kind of instructions if you simply use in context learning. But if you use
context learning. But if you use reinforcement learning with schema feedback or uh you can just simply treat all these kind of mod uh instructions
into convert that into uh a training set and do the continue pre-tuning and also the supervised van twinning and the
resulting model will strictly follow all the more than 50 kind of arbitrary rules and also for the reinforce learning you
can train it to do the two cores. For
example, it it can convert the pass K into pass one and also it more importantly it can do the latency reduction. For example, for the OS
reduction. For example, for the OS agents and GUI operation agents, we would like to have low latency in the operations. But if we just simply use
operations. But if we just simply use the in context learning and we rely on the the internal thinking the test time
scaling to just uh scrutinize each of the rules in the guidelines. It actually
works like a ch checklist. It's go use utilize the syncing tokens to go over the rules one by one. So if the latency will be not be very low
and if you really want the latency to be low and you need to internalize these kind of instructions and rules into the model itself.
So this can reduce the chain of thought and have a better latency reduction and uh for example uh if we just want to
do use two use and the two use is also kind of guided by some kind of rules some of the rules is in the tool description to describing what tool is
for what kind of purpose and some of the rules are resides in the system prompt or task prompt and for this kind of uh agent operations especially for example
we need to use phone calls to communicate with the external world we need to use the computer us agent we need to use a web search and knowled search a lot of these kind of different
tools to do these these things seems very uh this kind of process it can be written into a very long for example 500
lines of the procedure document this is very easy for the IO process to remember that. And so we can simply remove that
that. And so we can simply remove that 100 500 line kind of procedural document from the um from the model from the
context and it makes the the the latency is much lower and also the cost is much lower.
So uh some people may may just simply think that uh the resourceful learning is very expensive but actually it is not expensive. uh we
can refer to a paper. Yeah, this paper just uh do a simple uh this called I2.
It means uh uh uh this paper just uh trace the model to do the two calls uh to uh to write code to solve the mass
problem and it it actually uses the automatic trajectory generation and simplify rewarding them and training efficiency
is very high. You simply need for example $5,000 of GPU time just several hundred hours
of GPU time to train this specialized our model. And this model's
mass performance is on par with one of the most uh sort models. So if you
simply want to train the model to work on your own specialized area and do not need much generalization performance
is very great.
But if you actually want to generalization purpose over a lot of different domains it will need really tremendous comput computational power
and you will need m much more data. So
this is some most recent agent research.
For example, Kim K2. Uh this is model as agent. It performs a large scale agent
agent. It performs a large scale agent data synthesis and join ris learning training.
For example, in this the most important thing about this is a model as agent concept and also it tells us a very important
concepts of a model as product because if you really want to do the reinforce learning training you need to build a simulation environment and the simulation environment is already a
ready product because for example you need web search and you need this computer use you need kind of this kind of tools to kind of parse this kind of
thing. So the only thing uh that's
thing. So the only thing uh that's missing is a UI. If you do a user interface, this is a complete product.
So once the model training is complete, the agent can be directly released as a product. So this is very important for
product. So this is very important for our agent developers. We do not need to compete with all these kind of
foundational model companies.
they just build a general agents and the general agents want the model is they train and they will convert it into a product. So there is
product. So there is no kind of advantage of a pure agent company.
So for agent company to have some advantage you must specialize must specialize into area and for example for
the voice agents we have talked about a lot of specialized models including the ASI models the streaming perception models and also the TTS models and also
for the low latency we have the specialized thinking while speaking and thinking while listening models and also
We have this GUI operation models which resembles to a via model to do the MOS operations. So this kind of specialized
operations. So this kind of specialized models as what why we exist as a sort of alien company
and also we have some uh some knowledge base we have this kind of specialized knowledge and agentic processes and PRs
and knowhow for this specialized area.
So this is why we understand the current process of the model as a agent and we we we will know
this kind of train is coming.
So our uh actually our our agents is also model native as we talked uh in our last several slides about the SE
architecture of the real-time agents.
The this model itself are trained in an engetic process through a simulation environment not just talking with real humans but it is using the large
language model as a judge kind of thing to evaluate to do the robot and do the reinforcement learning itself
and for the critics and there are here are some uh semat systematic approach from Kimik2. So from the rules to self
from Kimik2. So from the rules to self pretty uh and we just skip this because these are techno technical details
and let's go into the method two uh in context learning in method one we talked about post training we talked about method to just to to do the fine tuning
of the parameters to train knowledge into the parameters but in g context learning is enabled by the large language models because it can
understand the uh context actually we can uh if we know about the transformer architecture we must know about the attention mechanism the attention is similar to
informationational query is similar to a key value store but is it is not accurate QML store for example in a key value store traditional QL store inside
systems area the keys is just a string and the value is another string and we compare the query with the keys and
fetch the value. But currently uh all these kind of query key values is adapted from the systems but this is
much different. This is based on the
much different. This is based on the cube similarity instead of exact match.
The keys are just represent uh the information inside all the previous tokens in including the input tokens and
the tokens it has already output and the queries uh no uh represent the need or the kind of information in the current
token. So it just compares the query
token. So it just compares the query with all these kind of keys and fetch the most similar ones and then we add these kind of values together. The
values means the actual time content of all these historical input and output tokens. We add them together and this
tokens. We add them together and this the resulting weights are waiting. So we
just scan through all these kids. So we
we can talk about this as a Bruce Boss approach. For example, when we just
approach. For example, when we just design for the any kind of database or any kind of key key value database or can relational database in the
transitory system, we need a lot of indexes to do this kind to filter out most of the keys. So we can only focus on
the keys. So we can only focus on several keys to to do faster. But
transformers is different. We can just leverage the almost infinite kind of computational power. uh actually it
computational power. uh actually it takes a lot of cost. So it is just a brute force approach. It is like what we
see that in the it's a bit less with higher computational power comes higher intelligence. Yeah. And uh so and it is
intelligence. Yeah. And uh so and it is also different from a vector database.
So someone may think that the uh transformer is similar to a vector database but it is actually different because vector databases have only the keys and values combined. This is just a
content to be searched and the query vector has a same distribution with the values because both are the embeddings of the text.
But in transformer actually the QV the distributions of all these three vectors are different. So we cannot simply use
are different. So we cannot simply use the same sparity kind of assumption as a vector database. So this is a very
vector database. So this is a very important uh kind of observation which will make the sparse attentions not so easy to work. Uh but before going to all
these kind of details we can talk about some several bottlenecks of all the londex uh which is memory wall and the computer wall. uh we we have very been
computer wall. uh we we have very been we have been very familiar with that for example we have the excessive kv cache uses usage because it's a batch size
times the sequence it means input output token size and uh it means uh the time by the the d of the model that means uh size of the hidden layer the the
dimensions of the hidden layer uh so actually uh we have a lot of optimizations to do the memory wall and we also have some ways to do with a
computer wall because it's a all order of n squared kind of attention capacity and for the memory wall we we we know
that there is MOA high lat uh attention it is a fundamental mental method to reduce the KV catch actually this is very uh simple kind of approach approach
it just converts uh just a dense attention full attention into three parts. The first part is a
parts. The first part is a summarization. It means to summarize 16
summarization. It means to summarize 16 tokens into one attention uh into one embedding vector. And the
second one is a sparse attention. It
means sim similar like a vector database uh it just uh ignores the vector that are not so similar with them. And the
third one is the recent conversation window. So it is actually why it caused
window. So it is actually why it caused uh uh sorry I was talking about natural spent
uh sorry this this slide is actually about MLA I was talking about NA instead yeah for the MLA yeah actually it
performs a low rank low rank projection of of the key key value vectors into a low dimensional len space that means uh to compress that and we have a share
representation of the multiple attention head. So they will have the same
head. So they will have the same lowdimensional key value representation.
Uh but this is not easy because if if we look into the uh the context look into
the formula of the softmax attention it seems that this is combinational but but there is small small thing and is about the rope.
This is a rotational embedding rotational positional uh encoding because we need to encode the partitional information into that and uh
but the con uh but uh if we have this rope we can see that in the formula it is no longer combinational so it is uh so it does not work
but uh if we want to keep this positional encoding information uh so the MLA work just do a very simple trick it just separates the positional
information from the main content. The
main content is passed through the low rank space through the uh kind of compressed CQV and also the precisional information is processed uh
independently through another pass and then we combine the information positional information and the content together uh and into a later in a later
layer. And this way we can reduce the KV
layer. And this way we can reduce the KV cache uh by uh 16 times and it will be much faster. So when we go to the
much faster. So when we go to the compute wall the compute wall means uh it's a coincide to train the QV cache similar to a vector database. But as we
already said in the several previous slides, it is very hard because the key QV the three vectors does not have a
same distribution. So if you want to
same distribution. So if you want to simply find the top several keys that pbounds have a highest attention with a
query, it will still requires the scanning a very large proportion of keys.
And so some works does propose to use some heristics for example some attention syncs for the first few tokens and some reason tokens uh to use for the
filtering and for example the we know that there's just some attention syncs behavior for the first few tokens for the first few few tokens when the
training because soft max has to one and sometimes we do not know which token to attend to and then we just by default put attention to the first several
tokens which serves as attention sync and for the recent tokens we need to take pay attention to them in order to do the next token prediction we need to
know the context of the immediate several tokens. This is very natural uh
several tokens. This is very natural uh just kind of observations. But if we just keep this kind of special uh just keep these tokens based on spatial
observations some of the highest m missing keys will be missed. So this will be a coveret of
be missed. So this will be a coveret of this kind of sparse attention and uh there is another way and and a
more uh I think this is the more fundamental way. It is the linear
fundamental way. It is the linear attention is to change the computational order to achieve the linear compatibility.
So actually this originates from the because a transformer is produced to reduce kind of uh to to overcome the
limitations of the RNN. So recursive
neural networks and but to overcome this you have the paralyzed computation but also have more computation to do be to be done. But if you just rewrite the
be done. But if you just rewrite the standard detention the softmax using a kernel function you can simply put the Q
and K uh multiplication you can split them and let the keys and values to do the multiplications together and then because the key values are
multiply together it will be avoid the construction of the very huge N * N matrix where N means the number of tokens in the historical
Cocus means the contact lens it is very huge. So the advantage is very extremely
huge. So the advantage is very extremely low computational cost and also it means the uh KV catch size and the memory and
also the computational number it does not increase with the number of tokens.
This is a very good behavior because it it means that we have almost infinite kind of tokens in the history in the context without any increase.
But this is obviously a lossy compression because we have informationational theory right we have information theory we tell that it cannot have be infinite so it must lose
some information and so the needle in a haststack can retrieval for example uh the capability is a
foundation of the in context learning because for example if uh if a If it's something does not be
able to do the needle in the haststack and it is not good for instruction only and is not also not good for the longchain reasoning capabilities
because all these kind of thing based on the income context learning and also the instruction following and longchain reasoning are the foundation for the
tool use capability. So if you have only use the linear attention you are not likely to build a really good agent because the agents use the most frontier
to use which depends on the description the uh instruction following and also depends on the reasoning and launching.
So this is for this problem. Yeah. Uh
and also this is another problem with linear tension that the soft max tension often performs better than the linear tension for the short context. For
example, we have a context length of 2,000 and the question is very short and with the this just have a same key value
size uh uh cache size. Yeah, for this short questions the performance is linear application is not as good as softmax.
So to uh to overcome this kind of limitations actually there are several approaches we will talk about two approaches. The first is Google's
approaches. The first is Google's infinite attention and the second one is minimax attention lightning attention.
The first Google's infinite attention uh is simply use introduces compressive memory into the traditional attention mechanism. Actually it has two kinds of
mechanism. Actually it has two kinds of attention mechanism. The first is local
attention mechanism. The first is local attention. It's just a master attention
attention. It's just a master attention soft max attention for the current segment and long-term attention is similar to a linear attention. So by
combining these two we have uh still have a uh memory efficiency. We have the fixed number of
efficiency. We have the fixed number of memory updates here but uh we uh but but we will kind of support infinite context.
Actually uh when they implement this this they are not simply using a linear tension but they using a compressive storage for the long-term memory.
So uh the length of the context is also still limited but it is much higher and much more memory efficient than a simple
soft max attention.
And the next uh example is the natural spas native spas attention for the deepse
in last several slides I may mistaken ns as MLA here comes the ns it means we have three kind of different
kind of attention mechanisms the first one is a course print is a quotion uh tokens compression that means a summarization
It's similar to when we use the summarization of a lung context to reduce the token usage. And second
mechanism is fine grain token selection that is similar to a vector database. We
just use a vector database to select the most uh common and most relevant kind of snippets of of the context into that.
But this is a precess context not the summarize. That's different from the
summarize. That's different from the summarization.
And the third one is the recent uh communication uh uh sorry uh the recent window of the conversational context. Uh
this is kept as a soft mask attention unchanged and so uh this full attention for the recent window plus the kind of
uh sparse attention uh for the token selection and kind of uh core screen summarization token uh the attention s uh k cache compression. This kind of
three mechanism working together can achieve the significant acceleration of decoding.
And now we come to the last attention of minimax. Actually miniax use a uh a
minimax. Actually miniax use a uh a drastically different kind of way because it it use propos a kind of thing
called latention is a combination of linear attention with softmax attention.
Uh in theory we can all only use the linear attention but we have talked about that because linear attention have a lot of drawbacks. It can the first for
the short context is not very good and then for the long context the in context learning capabilities and the hstack in the needle capabilities are not very
good. So we we just simply uh minimax
good. So we we just simply uh minimax takes an approach of inserting one softmax attention block after every seven lightning attention blocks. So we
have 80 layers and seven layers of lightning attention and one uh sorry 10 layers are softmax attention. So this
strikes a balance over the long and short run dependencies.
uh in this way mini supports more than one million tokens during training and more than uh more than four million tokens through the during the inference
and it is also a open source model.
Now we give some insights about the learning yeah because we have talked about uh this parameter updates through reinforcement learning and post training
and we also talked about in context learning. So many people are thinking
learning. So many people are thinking what's the difference about inter context learning and the parameter finetuning. Actually uh several the
finetuning. Actually uh several the recent advances in theory had pointed out that uh especially this paper
learning without training uh talks about that transformer blocks actually implicitly updates the MLP ways through
the context. It means that uh actually
the context. It means that uh actually the uh the in context learning the attention mechanism is doing the rank
one updates to the MRP part of the original parameters.
So actually each uh each time we output one token the mechanism uh of the attention provide produces equivalent of
a rank one updates per each attention head. So if we have 32 attention heads
head. So if we have 32 attention heads we have a low rank rank to 32 kind of updates to this MLP.
This is why language models can learn from the new patterns from the prompts and from the few short examples and also from the agentic two call trajectories
without any kind of twinning.
This implicit weight up uh update mechanism it provides a zer theorical basis of above the designing the better prompts.
So here are some counter in intuitive findings also in that paper uh sorry in that in another paper that deeper size without updates
because for this task containing the implicit patterns the in context learning can capturize a capture and utilize these patterns better than the fine tuning. This is quite
fine tuning. This is quite counterintuitive because most people think that uh just fine-tuning a model it just let the model internalize it
thoughts. But if you to take this in in
thoughts. But if you to take this in in time learning the attention is very uh sparse. But as we have already said in
sparse. But as we have already said in the previous slide that actually the attention mechanum is not sparse. It
just it is equivalent to a low rank update to the original ML parameters. So
um it just focused on currency. For
example, we do some uh do uh do do some uh uh do do some research through this uh using the circuit shift theory. This
is proposed by this uh paper. Why is
this SEO is probably better at pattern recognition?
Uh the short circuit means the subg graph in the model that is responsible for uh specific behavior that is uh
compos specific attention heads and ML layers representing the model thinking path for solving a problem and in context learning leads to a larger scale
circuit shifts that means the circuit after the prompts after the few short uh examples and after this kind of agent
working trajectories is different uh it causes a larger difference in the circuits that's responsible for specific
behavior compared to using the similar data train data to train uh supervis uh to to fine-tune a model.
So this means the P and another example is C pattern capturing. It is by manipulating the activation values of
the specific components and uh ask the question can you quickly capture the patterns and of the example
inside them. But the pattern capturing
inside them. But the pattern capturing needs attention to the few short examples of the prompts.
And uh actually we find that the uh that in context learning and not a simple pattern m matching but it actually activates several different
computational circuits in the model.
And this uh a and even with a thousand of times more training examples the improvement of finetuning is very limited. So this explains why in size
limited. So this explains why in size learning can achieve binary results even without any parameter updates.
So this simple efficiency of intense learning is is large. So this is also have a implification on our agents.
Before we just try to use refor learning or try to use supervised fine tuning to finetune agents, we need to first think whether or not we can simply use in
context learning, use a prompts, use fial examples to do that. If is it possible, we can do uh we can achieve a
better performance with a better accuracy.
and with a higher efficiency.
Okay. So this is the end of incontext learning part and the next we go to our method of externalized learning.
uh we we talk about externalized learning as a third different paradigm from the first that is parameter updates we are gradient descent and the second
in in learning because this is externalized external to the model itself it seems like it is just a hack but it is not just a hack let me explain that
and this is the most important part of this the first mechanism is knowledge based and second mechanism is to two generation
and let me first talk about the external knowledge base that is to rapidly accumulate experience from this trajectories and also applying the
experience to the new trajectories to the new executions.
Uh so we know as a knowledge base right a knowledge base has two parts the first is knowledge representation and the knowledge retrieval for the knowledge representation we have
several kinds of different kind of ways to do that the first way is through the summarization we can store the summary uh for example to cancel service B from
company A you need to verify the user's identity by providing order number reg registered email and last four digits of the credit card number. This is just
some summary and conclusive information into that. And another way is through
into that. And another way is through the lo that is for example the complete record of the transcripts you contact the customer service and you need to use
the descriptive file names and store in the file system for the raw details and uh and we know that summary and raw details are two ends of the spectrum but
there are some middle points that means the treebased summarization for example there is raptor raptor is great work For tree based summarization that simply
treats this all kind of raw details and raw trajectories as the lift nodes of the tree and perform the clustering to generate the summaries and the cluster
summaries to generate higher lows of the level of upper level and to the root level. And when you do the retrieval you
level. And when you do the retrieval you have two ways to do that. The first way is to use a vector database uh to just for the similarity matching through the
lift nodes and intermediate nodes and also you can index it like the file system. You can go over this using the
system. You can go over this using the descriptive file names uh using the agentic way and uh a final way about summarization
is to do the agentic autonomous summarization that is for example this this is the reason is that the summaries and all details and treebased ones are
just uh not agented. This is just a fix uh happen at fixed points of the agent working. But uh in the last way we can
working. But uh in the last way we can let the agent autonomously decide when to write the summaries into the file system and what to write. For example
for the rat file anal file these kind of things. Uh we know that a lot of agents
things. Uh we know that a lot of agents already use that. For example, chat GBD and cursor they they use autonomous summarization for that uses the agent to
autonomously determine that and for the retrieval part actually we have uh automatic matching and context insertion. The first way is just to uh
insertion. The first way is just to uh it is suitable for the fixed kind of information. For example, based on the
information. For example, based on the user ID, the profile is of each user, the contact information of each user and also for example for the business, for each kind of business, the contact
number of the business or the working hours of the business. This is kind of fixed kind of info. We can feed that into a very rigistic schema. So we can
do the automatic matching and insert that into the context. So there is uh uh no uh no No need for the RM to
autonomously decide that. And the second way is agentic semantic search which means the agent can autonomously decide which keywords to use for searching. For
example, it can use the vector databases on BM25 to match against the knowledgebased keywords.
And the second uh third one is to use agent tech file system reading is to let the agent automatically use the tools for example refile find grab etc to read
the content from the file system. uh
this uh uh this all these three ways are very important and uh uh and I want to highlight that the agentic way is very important because the semantic search
and file system reading are the two ways that the agent and autonomously decide to use. For example, it cannot find any
to use. For example, it cannot find any matches through the semantic search. It
should result to a job uh to to a fallback way of using the for file system search to read the file itself.
It is very important to give the uh redundant and abductant tools for the agents to leverage all kinds of uh uh
search capabilities similar to how human uses a knowledge base user database or use a computer file system.
And now we have a small tip. The tip is based on anthropics contextual retrieval. is a very important tip to
retrieval. is a very important tip to improve the retrieval accuracy of the rack. We have a core problem that uh the
rack. We have a core problem that uh the traditional rag loses contact when trunking documents. For example, in this
trunking documents. For example, in this example, we have the uh company's revenue grew by 3% over the uh previous
quarter. And if we do not have this uh
quarter. And if we do not have this uh context uh of of this document trunk the retrieval accuracy will not be very
high. But we can have a summarization to
high. But we can have a summarization to add a explan uh explanatory context to
this trunk. And so we can uh we can have
this trunk. And so we can uh we can have better understanding when we do the when we do the retrieval.
And now we comes to a comparison of the learning methods. The external knowledge
learning methods. The external knowledge bay and the building intention. The
building intention means on long context or the in context learning which we have just talked about. And for the external
knowledge base uh it is different but uh we if we do some simple analogy actually it means uh getting a light language model that don't support
syncing to sync using a prompt using sync step by step using a chain of thought but the building intention is actually a model that native sports and
long context why we talk about this because external knowledge base is basically based on vector database cases and based on the fes indexes
and the building attention has also have the query key values inside the attention the query and keys are just matched it through against each other so
it is a built-in kind of vector database or knowledge retrieval process uh so in theory the end to end optimization of the long context in the
long context learn uh in in contest learning will be have a higher potential for the effectiveness.
But uh actually rack has more pros. Uh
it is not not so so bad because uh actually we can use additional computation power to
summarize. The summarization is very
summarize. The summarization is very important. We don't simply post the
important. We don't simply post the original original transcript or agent execution trajectories use uh using the vector databases because we have the
summarization previous summarization and agent autonomous summarization and also we have the contextualized embeddings and contextualized indexes.
This transformation using the additional computational power is very important and also the comput during the summarization because we can control the
prompts the prompts of the summarization it can flexibly incorporate some industry knows of the specialized area.
So in this way rag may sometimes perform better than the long context and let me also compare the rag with
finetuning.
This is a based on the paper called finetun your retrieval and the core insight of this paper is that the retrieval argument generation is not
only more effective but also avoid some knowledge forgetting and conversation problems that fine tuning can cause.
We know that uh red is very uh good at retrieval of the accurate information or factual knowledge
but fine-tuning is not very effective at remembering the accurate factory knowledge. So and and also rag is very
knowledge. So and and also rag is very good at handling the new domains because it is a general technology but
fine-tuning requires data data augmentation. It requires retraining to
augmentation. It requires retraining to uh to to to uh uh uh to prepare multiple expressions of the same knowledge
otherwise we cannot remember them.
So ina in terms of uh generalization and in terms of learning f uh learning uh
efficiency red excels but uh sometimes uh fine tuning works well because when some task there are something that is hard to represent in text for example
how to ride a bicycle how to speak like a human how to speak more naturally how to uh speak with a IEQ
uh this is these kind of things very hard to represent in a text document. So
these kind of things are better for fine-tuning or reinfor learning through the first paradigm of learning that is through the gradient to update the model
parameters.
Uh actually we have the rag and the rag may enable some large scale knowledge summarization uh through the large language models.
Uh we know that uh we have traditional knowledge base and the expert systems uh this this kind of expert systems are not very scalable. This is because the
very scalable. This is because the knowledge is very fragmented and also the query is very efficient. But with uh
last language models when you have a a very uh very capable worker it can extract long context and extract this
kind of knowledge from the long context because humans are not good at reading long context. Actually, if you just want
long context. Actually, if you just want to read a novel of 100 thousand
tokens, it takes you a long long time.
But if you just feed this 100,000 tokens into a sort of reasoning model, it can all book any kind of answer any kind of
a question in one minute.
So it is actually more capable for large language models to extract this kind of knowledge of a specialized domain domain
into a compact representation and uh the next is for the tool generation is another way to do the externalized knowledge infection. The
first way we talk about is about knowledge base and the second is to enable the agent to write code to generate tools to achieve the self
evolution.
Here we have an idea from the Lita paper. The lita is from minimal
paper. The lita is from minimal predictive definition to maximum self evolution in GAA benchmark of general agents. It out outperforms many
agents. It out outperforms many complexity designed agent systems and ranked number one in the benchmark validation set.
Actually the core idea of this is very simple is to just to search the GitHub or other code repos for the opensource tools
download the tools and try to run them and also it also use the code generation technology use the code generation
capability to generate tools write code itself and then uh uh we have tools from the open source tools and we also have tools
written by the agent itself.
But uh using this kind of uh uh self-generated tools is very hard because we quickly made a n large number of tools.
For example, we have generate generated 10,000 tools or we has connected to a m MCP server. MCP means model context
MCP server. MCP means model context protocol. we contact to MCP server with
protocol. we contact to MCP server with a lot of tools.
We cannot put that into the context. For
example, if we have the complete to set of the GitHub MCP, it will require more than 200,000 tokens, the context is
explosive.
So the agent uh if we let the agent does do a passive selection of the tools it will be very ineffective.
So we have a paper called MCP0 which performs active to discovery for the autonomous light language model agents.
It converts a passive agent directory into a act active uh uh active
lookup. We let the agent to actively
lookup. We let the agent to actively identify the capability gaps and request the tools on demand. We do not put all the descriptions and all the parameter
descriptions of these tools into the same context of a light language model.
We can first select the server which means the platform domain and then load the tool list and then select the tool
and also for the tools we have multiple types. We have a hierarchical
types. We have a hierarchical organization or similarity search based organization of the tools. So the model can search
the tools and then load the tools and finally execute the tools.
This is for the tool scanning.
Uh so in uh uh in the next part let me give uh three use cases of the agent
self evolution of the two use. The first
case is to generate a intelligence RPA for the computer use. We know that computer use uh suffers from the low speed and C cost. We have talked about
that in the first part of the talk. And
for the traditional RPA, it also has a very strict limitations because it's a strict script and it cannot adapt to any
kind of dynamic interfaces or cannot know the changes of the websites and most most importantly it requires a human to write that. It is very cost
ineffective.
So uh to achieve this we actually in uh to uh to do the uh element localization to generate RPA code.
Instead of using the light language model to analyze a screenshot and produce action each time
we light LM to generate a trajectory which includes elements clicked in each step.
And to satisfy the dynamic content, we distinguish between the fixed processes and dynamic judgments. And we only extract the operation sequences which
are automatable.
And in addition, we have several engineering kind of tricks. For example,
to keep uh to tackle with uncertain uncertain response time, we use the event- driven execution. we listen to the next event uh to listen to the for
the event completion states.
So uh in this way we can actually generate IP program very effective IP
program and IP program for each website can actually gen uh be uh be notified when the website changes because
uh when the response does not come through in the reasonable time out it will just fall back to the language
model and also during the process of the uh RPA process. It can look into the uh to can invoke the light language model
to look into the current step. So some
of the steps are simply the RPA steps which is fully automated and the remaining steps are still handled by the vision life language model to cope with
the dynamic content. For example, we check the weather. If we use a traditional uh fully uh uh agentic
process of the screenshots automation and need is nine L model calls and 47 seconds
and after acceleration, it only takes 10 seconds. And if you want to book a
seconds. And if you want to book a flight on the official website, it is also five times faster. And we can see that there are four RP here which means
RPA are just accelerate the fixed part including the filling the forms but navigating through the website and
understanding the flies to select the best flight according to the candidates.
This part is still handled by the main L language model which means that we can do all kinds of things only accelerating
the critical path and also accelerating the repeated steps but also give the flexibility have the still have the flexibility to handle with dynamic
content.
And now we come to the second to two case is to visualize and parse the agentic logs. We know that we we have
agentic logs. We know that we we have the agents execution uh trajectories for these kind of agents. It includes the
two calls to call responses includes uh communication with the users as the user assistant messages.
All these kind of things needs to be visualized into a trajectory into a into a conversation history for better understanding and debugging.
But uh we know that the tool we have many tools. A sort of a agent may may
many tools. A sort of a agent may may have tens of tools or even hundreds of tools. They often also have sub agents.
tools. They often also have sub agents.
They can invoke other agents. For
example, our orchestration agent invokes a deep research agent. It invokes a phone call agent. It invokes a computer use agent. These sub aents are very
use agent. These sub aents are very diverse. They have very different
diverse. They have very different parameter formats and also the two recent resource formats are often different and because the agent there are many
people many developers developing the new code the formats are constantly changing.
So if you have a for example web front end engineer just parsing all these kind of logs and do the visualization this is very
fragile and very hard to keep the pace with the uh development of the agents and tools.
So we take take a approach of self evolution. We let the front end to
evolution. We let the front end to automatically report the passing failure to the agent and uh when we have this passing failure we generate a new kind
of code based on the example and then we need to verify that code because it related to the user interface related to the GUI. So we need to use a virtual
the GUI. So we need to use a virtual browser to check if the new passing code can correctly parse that. And more
importantly, we use a vision light language model to check whether or not the visualization effect meets our expectations of the original logs. The
vision light language models are very effective. They can just take the
effective. They can just take the original log and take your code and take your visualiz uh
effect just a screenshot of the visualization and determine whether or not it is a correct implementation
and if the implementation is correct the ping code will automatically be updated and then we come to the third two
is is also a a kind of agent self evolution which means automatic the diagnosis of the production system issues. For example, we have the
issues. For example, we have the production system of agent. We have many cases online. We have hundreds and
cases online. We have hundreds and thousands of cases online and manually inspecting the problems inside that is incapable. It's not possible.
incapable. It's not possible.
And now we can use the execution trajectory. We have them and use the lat
trajectory. We have them and use the lat language model to take input as a safe system architecture documents and the production requirement documents to
automatically analyze the act execution flow and then generate the regression test cases for each problem that identified and for some important work
items. We can read the scrum uh dashboard and create a work item and create a test cases and the session uh
that is means the agents working session for the manual inspection of the developer.
In theory we can uh go one step further.
We can from the test case and worker item and form from the original problem report. We can simply create a pull
report. We can simply create a pull request and even review the pull request run text cases fully automatedly. But we
haven't done that yet. We still rely on the humans to use the cursor cloud code these kind of agentic coding programs to work on the working attempts. But uh I
think one day it will come true. Just
you let the agent to do the self evolution to localize and find the problems from its own execution logs.
Generate the test cases and working items and find the problem details and then finally to generate code according
to this and do the regression test. And
if the regression pass is passed, the agents modification of codes is automatically put online.
If we can make this come true then the agent can really do the self evolution and there will be no human or the human
will only needs to instruct the process and do some necessary re reviews of the pull request to maintain the code
quality and the design.
So finally we come to the summary from the post training to the in context learning to externalized learning.
In the second part about agent learning from experience we talk about the first paradigm of first post training which means through the refor learning or
supervised fetuning to perform the parameter update.
And the second paradigm is in context learning which is in theory the soft update of the attention mechanism during the inference is a low rank update to
the parameters and uh uh in context learning had pros of fast adaptation and no training but uh it is not good at representing
knowledge uh that is hard to represent uh through the text. And the third paradigm is externalized learning where we have knowledge base and tool
generation. For the knowledge base we
generation. For the knowledge base we have the text to based representation through summarization through the raw context through re tree based summary
and agent based summary. And for the two generation we have the code we just codify the processes to through two generation to achieve the efficient and
reliable automation.
So uh so many people will just ask why uh so why the three paradigms coexist.
So actually the three paradigms are suitable for different kind of purposes.
The first post training is better to representing the people uh from uh information that is hard to represent into text. For example, how to do u
into text. For example, how to do u speak like human uh like what we did inside uh the first part the real time
part uh about the for example the VA model to operate uh this uh the TTS and ASR models the light language model to
sync how to sync the fast thinking and slow thinking how to do the thinking while listening and thinking while speaking these kind of things are better um through the post training to
internalize the knowledge and internalize the procedure into the model parameters. This guarantees the latency
parameters. This guarantees the latency and uh this makes sure that you can use a very low latency without going through the checklist in the prompt uh one by
one and for the in context learning is comes very faster and it is very good for the uh factory data because the factory data in in context learning is
more accurate in for uh for doing that training that into the parameters and for externalized learning It is for the long-term memory uh in uh
in contrast to the in context learning through the knowledge base and also you can automate some processes uh via code through the tool generation. So the LM
does not need to know about the details of the tools itself but it not only needs to know about what the tool does the description of it and it is a level
of abstraction. So if you when you call
of abstraction. So if you when you call this tool or when once you call the sub agent the sub agent will just take all these kind of details for you and the
main agent will reduce this mental burden by uh adding another layer of abstraction that is a two call
and uh actually externalized learning is beyond makes a way to let us move beyond the limitations of attention. uh because we
have par par meterized learning by training and in context learning we have talked about that and uh is externalized
learning have fundamentally solved the three major problems of transformers.
The first is hallucation and reliability because external knowledge base could provide a verifiable and precise source
of truth and also it but uh it tackles the inefficient learning problem. uh
because uh if we tune this into many parameters, we know that the parameters uh uh need a
lot of time to train and during inference time we need to go through all these parameters. Even if we have
these parameters. Even if we have mixture of expert, we need to go through a large portion of the experts. So this
is a still very inefficient way for retrieval.
But if we use a knowledge base or tool set, we have talked about technologies uh to navigate through the large tool sets and navigate through the large
knowledge bases with with a very efficient way.
Uh actually in the last uh two slide I want to talk little bit about scaling law. First slide I want to talk about
law. First slide I want to talk about the pre-training to reinforcement learning. uh as we know that uh uh the
learning. uh as we know that uh uh the bitter lesson uh from Rich Richard Satin uh it says one word that we want AI agents that got discovered like we can
not which we contain what we have discussed what we have we have discovered building our discoveries only makes it harder to see how the
discovering process can be done and the pre-training uh is based on predicting the next token nest token prediction and in the phase two we have reinforcement learning that is propose training
through interaction with the world. Uh
so the phase one means uh we have AI agents that contain what we have discovered and in the phase two we have reinforcement learning that can discover
like we can agents in the phase two can allow the model to move beyond the passive learning to actively explore the world
to interact with humans the internet the website computers using the graphical user interfaces and also it can interact with physical world in the
next step for example the controlling the robots autonomous vehicles for the humans the first step we come to
the real time voice so this learning method allows us to build a second curve of the scaling law to learn from the
successes and failures of interactions however for the both phases is pre-training or reinforcement learning
won't have a limitation of the transformers a bottleneck it means the precises because we we know that the transformers
suffer from annotations and information confusion. So if you want to precisely
confusion. So if you want to precisely memorize this kind of dynamic and specific details for example the about
each person or each business or each region it is very hard.
So the future path maybe it is externalized learning which may continue the scaling law. Actually in the business lesson article of rich satin
there is a one sentence that there are two methods that seem to be scaled arbitrary a search and learning.
Actually external systems externalize learning is search and learning. By
learning it means to learn and summarize the interaction experiences into the knowledge and code. Code is tools and knowledge is text represented in the
knowledge base and summarization.
And search corresponds to the searching the external knowledge bases and also the two libraries either through the
agentic search uh through keyword or through the file system navigation or through the vector searches. All these
kind of searches is very highly scalable because we have 20 years of experience on how to build a highly efficient and
highly scalable search system.
So in this way we can have a precise and structured knowledge base to summarize this use the sort large language models to summarize all these kind of
structured experiences into the structured knowledge.
uh we this is very important because before large language models this kind of high quality summarization require expensive domain experts. So there are a
lot of expert systems several decades ago and they are based on human written rules but they do not scale. They do not scale simply because there are no no so
many domain experts.
But today there are so many large language models there are infinite number of nomen experts.
Actually if you look into the large language processing model uh field NRP actually this is the only sub uh uh the
only traditional NRP topic that still survives in the language model area is summarization. uh the ne for sample name
summarization. uh the ne for sample name entity extraction is already old because uh we do not need it anymore but the
summarization is still alive and it may be alive for more than several decades later and apart from the knowledge base we
have code as a universal knowledge representation for example if we can do the summarization using a code representation
to ensure the structured database and ensure the structure and high quality and also for the tools it can generate
this tools to represent some procedural knowledge we have in the experience.
The code is not only a just a tool for the programmers but very important a universal kind of cap capability of
sort language models. It is a universal structured data representation. It is
precise verifiable and composable. So
this is very important to use the code as a knowledge representation.
So in a future kind of system we expect to see the co-evolution of large language models with external externalized learning including the
knowledge base and tools where external knowledge base manage the precise and dynamic kind of knowledge and the model itself performs the general reasoning
and understanding capabilities and also for the domain it may have some uh kind of knowhow knowledge. some rules that is
uh needs to be internalized to the model parameters and final uh uh we talk about Pine AI.
Pinei is our company and we are looking for the full stack engineers who can build the SOTA also known as AI agents.
We have a philosophy that everyone's contribution to the company's valuation should be in the tens of millions of dollars because we can see the best
companies AI companies including open anthropic Google and also you can look at cursor this is kind of good aging companies
each company has a valuation of everyone's contribution more than 10 million of dollars so what We believe that in the time where we all use this
kind of AI assisted programming, everyone can instruct the AI similar to instruct the
normal engineers. So we no longer need
normal engineers. So we no longer need the engineers whose capability is lower than the soda land language models.
Actually more than 80% of code of uh our system is already using through the coding agents for example cursor cloud
code and so on so forth and we expect everyone to love solving problem hands-on uh we know there is a famous quote uh by Linus is talk is cheap show
me the code and someone is not many people not using the web coding and say that this quote should be modified to
code is cheap. Show me the talk. But uh
no matter what you use, use talk or use code, you need to solve these problems hands- on because everyone will be a combination
of architect and product manager.
It needs to you uh do the architect work to design the uh technical arch to to design the architecture of the system
and to instruct the coding agent to um finish the job and also it needs to review the code produced by the coding agent and also it needs to understand
the product. It needs to know know what
the product. It needs to know know what to do what is good and what is bad. The
evaluation here is very important.
Actually, everyone should have a good sense of determining the what is good and what is bad. Traditionally, this is only the task of a product manager. But
now every programmer, every developer now needs this kind of capability.
And we also need to a solid software engineering skills and understand principles so we can know what kind of things are best for the agents to do and
what kind of agent techn architecture is best. I think that using uh coding
best. I think that using uh coding agents uh fluently is very important because coding agents is one of the most
successful agents in uh this year and if if one is good at using coding agents, it knows the capability boundaries of the agents and the last language models.
It knows uh what it can do and is not cannot do. So it can uh use the make
cannot do. So it can uh use the make best use of the capabilities. It knows
uh then one if it can use the coding agents better it knows how to write the prompts how to write the cursor rules how to organize this context how to
manage the context that means how to when to do the summarization where to when to create a new context yeah so on so forth. So this is very important.
so forth. So this is very important.
Now our mission is to truly solve the users troubles a gassing sun by building all the leading voice agents and interactive agents that interact with
the world in real time and really learn from experience in a very effective and lowc cost way.
Uh so finally we come to the last slide.
There are two clouds of our agents and the first cloud is real-time interaction and second cloud is learning from experience.
Uh so we want to end this talk from the famous quote from the beta lesson. The
biggest lesson that can be read from uh 70 years of AI research is that general methods that leverage computation are
ultimately the most effective but by a large margin.
For the real-time interaction part, we propose a new architecture that is better leverages computation to do the continuous thinking while listening and
speaking. And we also have the
speaking. And we also have the perception layer and also the execution layer that works in conjunction with the
continuous syncing major light language model to produce a fully interactive and real-time streaming experience.
And for the learning experience part we propose three paradigms. The first paradigm is to buy the post training of
the models by parameter update. The
second is a in context learning through implicit parameter update.
And the third is externalized learning by using uh external knowledge
base and writing tools to let the agent evolutionize itself. The learning is
evolutionize itself. The learning is from experience especially is a externalized learning where knowledge base and the tool set is some general
method that can leverage a lot of computation to do the summarization and code generation that I'm ultimately the
most effective way of scaling law and keeping the scaling law to a next decade. Thank you so much for listening.
decade. Thank you so much for listening.
Loading video analysis...