Two Clouds Over Agents: Real-time Interaction with the Environment, and Learning from Experience

By Bojie Li

Summary

Topics Covered

Two Clouds Plague AI Agents
Interactive React Enables Thinking While Listening
Fast Sync Slow Thinking Mimics Humans
Three Paradigms Advance Agent Learning
Externalized Learning Scales Beyond Transformers

Full Transcript

Hello everyone. I'm Bujin Lee, the co-founder and chief scientist of Pine AI. Today I'm honored to give a talk to

AI. Today I'm honored to give a talk to everyone about the clouds over the agents. That's the real-time interaction

agents. That's the real-time interaction with the environment and learning from experience.

So we know that there was a famous word from Lord Kevin Kelvin that the physics knowledge are almost complete and only two small clouds remain over the horizon.

And actually in the agent we also have two clouds. The first cloud is real-time

two clouds. The first cloud is real-time interaction and the second cloud is learning from experience.

For the first cloud about real-time interaction we know that uh this there are lot of high latencies in the voice interaction and the voice interactions

feel very unnatural some of the times.

So it looks like a robot not like a real human. And for the second there is some

human. And for the second there is some learning from experience. For example,

if you build a agent with cloud or gymn and it starts each task from scratch, it does not have any ability to learn from

the existing task regardless of task success or failure.

So we propose several paradigms for agents to learn from the experience. And

we also propose a new architecture for the streaming event driven agent loop which enables interactive observation syncing and action. So uh let's go into

the details. So before we go into all of

the details. So before we go into all of this kind of details we we will refer to a tube clause the problem actually not

pointed pointed out by me. It is pointed out by the open ed research scientist in Yao. Actually there are uh two two two

Yao. Actually there are uh two two two kinds of uh problem. The first problem is the uh evaluation that should run

automatically.

That means all these kind of evaluations does not have to engage with human or does not engage with a real environment.

It only just uh performs a lot of two calls and encompies but it there is no human involved. And the second uh

human involved. And the second uh problem is actually exactly our second problem that is does not have any kind of mechanism to learn from the

experience. For example, uh he he gives

experience. For example, uh he he gives a example if we have a test test test set of 500 tasks and you can ever uh you

can just execute this kind of task each one by one independently.

But uh this is does not seems to be similar to a human. For example, a cumin software engineer can just solve this kind of issues faster and faster and as

long as you get more understanding of this repository and about the guidelines and coding kind of guidelines inside the organization.

So these are actually the very hard problems that kinder of all these kind of agent deployment in

in the real world. So let's tackle this kind of problems. The first part is agent realtime interaction with environment. The environment means

environment. The environment means humans the internet or the physical world which means it interacts with humans with real time voice and

interacts with the internet through the uh computer use through the web browsers or the GU or interact with the physical world through the robots. That's the

future work. So for the first time the simplest one the agents interact with the environment is the voice agents.

That means we have the real time interactions with real humans.

But actually we know that we all know react is reasoning and action is actually a loop of observation and reasoning and action. But the latency of

this is very uh very high. For example,

we have live language models. The live

language models need salts and the source the thinking needs more than 10 seconds for some set of are these kind

of thinking models and also reg we if we only optimize the LM part it is not suffice because we also have the

perception stage. In this stage we we

perception stage. In this stage we we need to detect the activities of the user. We do need to know when the user

user. We do need to know when the user finishes speaking. The reason is that

finishes speaking. The reason is that the most of the speech recognition models are just

take as input a piece of audio and output a text. It is not streaming. So

you need to wait for the speech and event of the VA model which is which means voice activity detection and then feed that into the ASI model and then

feed the complete sentence into the large language model which performs the thought process and then split all the responses of the large language model

into this kind of sentences and then into a texttospech model to the to the voice response.

So the cumulative latency of all these kind of this is far exceeds the human tolerance. It often

tolerance. It often is as long as 10 seconds or even more.

But if we just want to optimize for example the large language model part and to reduce that into the latency of 1 second we do not want any kind of

thinking. There is still more problem

thinking. There is still more problem because for example uh if the someone just offer offers you a plan it wants to

say the plan is very good would you like to accept it and the AI would say it's very good let's just accept it so this

is without agreeing to a not very good plan without any thinking so how can we just balance the dilemma of the fast and slow thinking this is the harsh thing we

want to tackle and our architecture actually composed of three layers. The first layer is perception layer, the second layer is

thinking layer and third layer is execution.

So for the first layer we convert a continuous real world signals for example the voice ai video into a stream of discrete events. These events are

easier for the syncing layer large language models to process. It's just

some tokens and the syncing layer will process these kind of events in synchronously.

And the most important thing is that it enables the thinking while listening and speaking. For example, it just of uh

speaking. For example, it just of uh when you are just thinking and you some events come, you do not interrupt your thinking but you just accept this new

events insert it into the syncing process. So this generates a interl

process. So this generates a interl sequence of input events that means the events received from the real world from the

voice and also the internal source and also the actions. This is an interative sequence. We will go into more details

sequence. We will go into more details in the next slide.

And third, for the execution layer, we will convert the discrete action commands back into the rework signals.

For example, the voice the through the text piece to speech model or the mouse movements through a VI model to operate the computers.

So first for for the first layer we go into the perception layer which converts a continuous real world signals into the discrete events.

For example, we first go into the limitations of VD and ASR. Uh for example, we know that VD is

activity detection and ASR is automous speech recognition and uh there is significant latency accumulation problem

in that. For example, there is VD

in that. For example, there is VD detection latency because we need to know that this user has stopped speaking

and we need to wait for 500 milliseconds or even more to ensure that the user is not speaking again.

And after the user has stopped speaking a word and that's already five mill 500 milliseconds and then it goes into the ASR model and

ASR just process it with for example 100 or 200 milliseconds and this is some accumulated latency also there is some uh another problem

for example information loss it only output the binary signal and for the VD model and also the ESR also auto only

output the text without any acoustic details. For example, if someone just in

details. For example, if someone just in a bad mood or good mood. Yeah. As cannot

determine that there is some background music or some background noise. Yeah. Or

even the user has some uh uh simple aha this kind of sounds it cannot capture that. There is information loss in that.

that. There is information loss in that.

And also we have error propagation here.

The error propagation is that because the uh VD models are quite small. These

are just quite small models that only runs in the CPU with just 1 millisecond.

So the accuracy of that is not very high. If you tune the threshold to very

high. If you tune the threshold to very high threshold, they will miss some simple words for example hello and aha.

And if you just turn this threshold to very low, it will include some non-spech voices. For example, if you just knock

voices. For example, if you just knock this kind of uh table or there is some uh yeah, it would be treated as speech.

This is not very good. And the last thing but also a very important uh limitation is the lack of context in the speech recognition models because it

just recognize the speech in the segments. The VD model has just cut a

segments. The VD model has just cut a complete paragraph into short sentences into these short arteries and the AS

model ASMR model does not have any context to understand this partial sentences especially when we just for example recognizing the addresses the

brand names the personal names and also some domain specific terms it does not have any context understanding that for example a simple example if just we know

that a person is called Bji. It is very hard to pronounce in English and uh in in some context the just associate says

the user's name is BJ and also says that email is b o j i e l i gmail.com and yeah because everyone here can

understand that because we know that bji is a name is a speaker yeah it is already written but uh ASM model does

not know that so the error rate of this recognition is very high. So these are some problems of existing approaches

and to replace these to to to resolve these kind of uh problems we need a new streaming speech perception model based

on the auto reggressive light language models. We know that light language

models. We know that light language models are very efficient in handling this kind of contextual information. Especially for

contextual information. Especially for example the contextual information about the personal information or the domain

specific terms and also some uh brand names and accounts uh among of the prices. This is this is a better to

prices. This is this is a better to understand for if we just use a large language model to do that instead of simply relying on a small model for

example whisperer or sense force to do that and also we need would like to why we call streaming approach we do not

want to cut this input audio into the smallerances into the small segments and put into a standalone model we would like to build a multiod model model that

means the incoming space tokens are just coming into a streaming manner and the text and other acoustic events I just output in a streaming manner. For

example, it does only outputs the text tokens but also output a lot of special tokens that means acoustic events. For

example, when the user starts a speak finishes speaking, it also issues a speak start and speak end events. These

events are quite similar to the VD model what it outputs but it is much different because VD model is a very small model that just determines whether some sound

looks like human speech but this spot uh speaks spot and speech end mode uh events are produced by a lot language model it understands the meaning of the

uh user's utterances so the accuracy of these kind of start and speech events are very time and also for example the

user uh may have some interruption intent. If we tell the model that the

intent. If we tell the model that the currently the AI is speaking now and then this model we produce a interrupt

token when it determines that the user needs to interrupt. However, if the user does not interrupt, for example, it just simply says, "Aha!" All this kind of

thing, it does not need to recognize this. So, this there is no interference

this. So, this there is no interference intent.

And in addition, for example, there is some emotions and other kind of information, for example, laugh or music or this environmental sound, there is

background background noises. we can get this kind of information not only the text but also the acoustic events.

Well, someone will just say that why do we not use end to end kind of model that means speech in speech out just audio in to the audio tokens and then out or why

we need the text at all. Why we do not convert uh LM and TTS into just merge that into a unified model. We admit that

this is maybe the future. uh we think that the future will be multimodality but this is just in theory but in

practice we have experimented with a lot of the closed source and open source kind of models but the uh ASR and LM and

TTS kind of seri may perform better especially in the terms of intelligence.

So why there is intelligence drop when we have multimodality?

The problem may be from the modality conflict. For example, when we train the

conflict. For example, when we train the model, we do not have enough training data about the voices.

So we need to first train a text only model and then add the audio encoder and decoder and use some small amount of the training data to train a protection

layer among them. So this will result in some parameter divergence. For example,

if we use ae model, the mixture of experts and some of the experts will just specialize in handling speech. This

is what we what do we we do not want to know because if we the experts are just specializing handling speech, it is just a separate path from the main

intelligence. It is lack of real

intelligence. It is lack of real reasoning.

So the reasoning ability inside the speech task will significantly decrease.

So to resolve this kind of problem and we just use this kind of seal architecture of a pipeline per perception layer, reasoning layer and

execution layer instead of single standalone end to end model.

And let's go to the layer two. The

syncing layer a studio achieve a interruptible and asynchronous model which accepts uh listening events

the acoustic events and text transcripts and while the it is continuously syncing and also it can speak while think uh

syncing. For example, it can give a

syncing. For example, it can give a short response in 1 second and then sync for another while and give a longer response and more correct response after

that.

So this input is just a stream of input events including the acoustic events and transcripts and outputs are the GUI operations and auto speak events

including the text and emotions to speak.

So the uh the core innovation here is that we avoid the rigid observation thinking action loop of the traditional react. For example in the traditional

react. For example in the traditional react we all know this is a traditional agent uh architecture.

It is just a region loop of observation that means the transcripts and also the thinking. The thinking must be completed before the action. Now

action means to the something to say and if we just think and we cannot we do not have any choice of to not do

any action. For example if human says uh

any action. For example if human says uh please wait and after think it must output something. So we must do

output something. So we must do something instead of waiting we not must must say something. This is this is contrary to the users's requirement do

not uh do do not say right and also if we are thinking and another user's utterance comes for example the

user just says some additional things and the existing thinking will be dropped it it will just uh uh we need to think

from start we need to think from the scratch I think again to get the next thing to do. So this is observation sync and act and the observation again and

sync and act and also the most critical thing is that the sync latency is very high. Uh this is what we just said late

high. Uh this is what we just said late uh earlier in several thr is a high response latency problem. So in our

interactive react uh approach we propose some flexibility inside the interled kind of observation syncing and action loop. We in this way

we can adjust offer the thinking while listening. For example, the new

listening. For example, the new observations can be inserted into the thinking operation at any time and also the speaking we can just respond

immediately after the quick thinking and then we continue thinking and perform another response again.

So this kind of approach maintains a unified thoughts and it continues this thinking. Why this works?

thinking. Why this works?

So we basically observe the LM processing speed. We know that for the

processing speed. We know that for the preview for input tokens we can achieve more than 500 tokens per second for the

input. And also the SA models including

input. And also the SA models including gim 2.5 flesh and some other models can achieve more than 100 tokens per second

through uh throughput output.

But voice input and output does not need so much tokens. For example, a normal human speech just have five tokens per second

and the input and output only takes five to 10 tokens per second.

So there is more than 90% of time idle if we look at the whole time time stream and it has a redundant time of thinking.

So the problem is that we should leverage the gap time outside of the off output input and al outside the voice output and leverage all this

time for the internal thinking. This is

a core problem because the traditional react approach just want to leverage the time between the observation event and

your actions and you need to reduce the latency from the observation to the latency uh to the actions. So you have very low time of thinking but when the

other partner when uh when associate or when the user is speaking you do nothing the RM is not running at all.

So actually we want the light language model to work in parallel with the user and with the TTS is speaking a lot we

want to sync in the background. So this

means the fast syncing or slow thinking this kind of thinking layer. Yeah. Uh

actually uh for example uh when we train the last jump models to do this kind of thinking we tend to have a fast response

when we observe any kind of external event from the other party. For example

the user say something and we expect to get the response response in 500 milliseconds. So we reserve only 50

milliseconds. So we reserve only 50 tokens for that and after that we have a more in-depth analysis. For example, we

give it 500 tokens for a slower sinking and it will just output another thing.

And for example for some very hard problem even 500 syncing tokens is not enough at all. So

we need to think twice and think third times or even more. So we need more thinking tokens to do more

harder tasks. So when the multiple

harder tasks. So when the multiple rounds of thinking are needed the result will conf will just output a summary of the current thinking. This is anal

analogous to the human beings because human beings will will say something while it is thinking. it will not just do complete silence.

So this is uh how the fast syncing and slow thinking and continuous thinking works. So let's see a real world

works. So let's see a real world example.

For example, if the user says that I want to change my plan from the current $19 to a new something and then the the

there is thinking that user I want to change the plan. So I want to know the user's current plan details and new plan prices. These are the internal thinking.

prices. These are the internal thinking.

But before this is actually started the user may interrupt and say that uh another word for example by the way is a

new plan one that's 90 uh $79 per month and then we'll continue the previous thought. This is very important because

thought. This is very important because the thinking internal thinking is not just dropped and start a new thinking.

It is just marked as interrupted and the new observation just plugs in and then the you the uh the interrupted thinking simply continues

and after the sinking it will just uh respond to the assistant output the assistant action that means respond to the user uh about the uh final thinking.

This is important. So this is gra gracefully handling the interruptions inside the conversation

and also there is another part that is speak while thinking. uh because the last part I say is I think while listening and another another thing is

speak while thinking that means we first give a quick confirmation or quick response and then sync and give a longer response.

For example, a traditional way is that the user just wants to confirm an order and there is a long silence of up to 10 seconds and the system just said confirm

or not. This is very unnatural. But for

or not. This is very unnatural. But for

the interactive react way, the user may just ask do you want to confirm and the sync just use 50 at most 50 tokens to do

a quick sync and just up uh to say some filler words. But the filler words is

filler words. But the filler words is not using the engineering approach. Just

say some let me think. Uhhuh.

This kind of fixed filler words. It is

generated by the live language model and it is cont contextualized. It understand

the current context and say something that is useful just not waste the time.

For example, in this case, it just say let me just confirm that is the 79% dollars a month plan, right? And this is

a preliminary response. And after the indepth sort of at most 500 tokens, it gives the final response.

So this is a speak while listening uh sorry speak while thinking and for more detail sequence

it involved all uh two uh these two aspects thinking while listening and also thinking while speaking. uh then

let's not go into not go into the details into that you can speak uh talk about this offline and also this is kind of interruption

handling example yeah we we have the interruption and resumption in the real time conversation and so I we can see that that as uh as a

conversation goes longer and longer the internal sauce is a very hard burden Because we have said that if we only

take the input voice and output voice is only five tokens per second. But if we take all these internal source into here

is 100 tokens per second. So is 200 uh 20 times more tokens. The motor tokens is good for maintaining the consistency of thinking and fluency of thinking but

it's not good for the price because we need to we need a very hard very high cost for this very long context. So we

need to truncate this kind of most recent trends of this recent thinking and also we for the older trends we simply omase this kind of internal

thinking and we need to uh do some windowing approach to make sure that it is friendly to the KV catches and uh uh for this kind of adaptive

thinking model actually uh we need some training to do that. If we just want to adopt a offtheshelf model to do that, it is not very good at understanding

interruptions continually syncing while listening when some no interruptions just come or it is not very good at

uh outputting this just a quick sauce and then followed by a long sort. It is

not good at controlling it internal reasoning to 50 tokens to keep the initial response below

500 milliseconds.

So to achieve that we need to have some uh extract some data for example using the cloud for sonet to generate this kind of training data and this uh

including the alter alternating thinking and assistant kind of texts. uh and we just output this kind of syncing data as

a content part because the major models including the openi gymni and this cloud has some source summaries this has no

original syncing tokens there the cloud for sonet is original syncing tokens but other models are not so we need to dist

generate the test cases for the sync and assistant X uh we need because we are using a S model we can use a lot of prompts to control that to and use

specialized test cases to let the model generate this kind of behavior just as we can uh already observe in in the previous examples.

So uh in this way we can gen uh generate a lot of context uh supervised fine training data for the models and then we can use reinforcement learning to train

a new model for example based on uh sort of can and low latency opensource model.

So we get for the adaptive thinking and for the thinking while listening and thinking while speaking and for the data engineering we need to

solve lens control as we have just said in the last slide and also we need uh several different kind of scenarios including the simple greetings the complex decisions and multi-terrain

dialogue.

uh there is something we need to uh highlight the salt length must be considered as a part of the reinforcement reward

function. The reason is that because we

function. The reason is that because we have the long source here but if we do not control the soft length it tends to

use up all these kind of thinking budget window. This is not very good. For

window. This is not very good. For

example, for a simple greeting hello, do you need really need 50 tokens to sync?

I don't think so. It will be very simple, very just stink even no sync because it's really simple context.

And for example uh there we know that after the quick response it it if it determines it is a hard question really

do we really need 500 additional tokens use up this token budget to do the final response this comes after 5 seconds I

don't think so in most of the questions is not very hard so actually we can use the soft lens as part of the reward

function and to panalyze the simple questions that use a very long thought.

In this way, we can do a really adaptive thinking achieve this optimize for the simple questions to reduce the actual response latency for the simple

questions while we reserve the long thinking capabilities for this hard questions and for this mission critical questions. For example, do you want to

questions. For example, do you want to ac accept this plan? Do you want to lower this kind of thing? Do you want to accept this offer? And you need

something in some filler words and filler sentences to uh just uh you get more time, more thinking time for

you to evaluate that plan and finally make a decision.

And uh next we go to the layer three, the execution year. The execution layer it just takes as input of all the outputs of the last layer the syncing

layer. The last the major last language

layer. The last the major last language models. The major last language model

models. The major last language model simply act uh action is speak or click this kind of thing. Yeah. The speak is just to speak uh with uh to to say

something and the click means to click the computer use to click the virtual browser or virtual computer through a graphical user interface

and output is some continuous signals including the boys waveform and including the mouse trajectory.

uh so the uh we could first focus on the uh on the mouse problem. Uh I think there is uh uh for the computer use

agents uh we know there are a lot of computer use agents online and uh almost every sort vendor has produced their

agents but the common problem is that the agents are very slow.

Now why we agents are very slow? It is

because it takes as input a large screenshot maybe more than 1,000 or 2,000 tokens and you need per few time

to handle that kind of screenshot and you in internal thinking to generate a click uh position and then you need to

click that position. It takes more than four or five seconds. It's very slow.

But if we use a smaller model for example a 7 billion parameter model it can understand the uh screenshot well for example if you have one thousand

kind of token this kind of screenshot it knows that the click is there a search button is there you need to click that but the problem is a small models or

especially the models not specially trained for the computer use task often for the kind of have this kind of problem is that it model knows what to

do but the model does not uh cannot do that. For example, uh when it produces

that. For example, uh when it produces its internal reasoning, I know that I need to click this search button in

order to click it outputs very clearly.

But when you can see that the uh the output coordinates of the click operation for example the click XY just

want to c to locate the element to click the XY is often not the exactly the click but the search button to the click

fields. So why the failure uh of this uh

fields. So why the failure uh of this uh is this because actions speak conflict because the live language models are

trained on on this kind of uh operations that only text only but it does not have have any kind of understanding of the

screenshots. It does not have the

screenshots. It does not have the understanding of the xy positions the coordinates inside the screenshots.

So this is not good. uh and uh I we know that many many there are many kind of open source framework tend to tackle this problem by using uh just labeling

approach uh it just use the labels the bounding box as a grounding hint for this vision life language models how it

works it just take as input uh screenshot and we from the browser API we know that some of them are clickable elements

And we've uh just draw the bounding box of the clickable elements and mark them as one two three four give this labels and in the problem we tell that here are

the labels here are the bounding boxes and you need to output one two three four these kind of labels in order to click them. So the cognitive harsh uh uh

click them. So the cognitive harsh uh uh difficulty of understanding the exact coordinates just reduces to the

outputting the labels of the bounding boxes. It is very clever technique but

boxes. It is very clever technique but this the but it is very hard to generalize to general applications.

For example, there are many compl complex interfaces. For example, there

complex interfaces. For example, there are the Gmail interface. We can see 50 kind of emails in the uh

in in one sequ in one screenshot and there are many labels, many tabs and many action buttons. So there are more

than 200 actionable elements inside the simple screenshot and the bounding box just overlap it with each other which

makes the uh screen very hard to understand very and and there are even something harder for example many kinds of editors

including the text editors the rich text editors the document editors the Excel Well, this kind of spreadsheets and the PowerPoint and also some video

video editing software there's all kind of editing software you cannot use even have any job bonding boxes. So you must

know the exact coordinates in order to operate these kind of editing softwares.

So to resolve this kind of problem we need to train a end to end v model. Uh

so the V is comes from the robotics terminology and actually it the GUI uh operations uh kind of task or the

computer use task is very similar to a VE task to robotics task that interacts with the real physical world because it

just a v vision you need to know some screenshot or you need to take a photo of the outside environment and you have the language uh language means the

objective you want to achieve and you have actions. The actions in the

have actions. The actions in the robotics terminology that means uh for for each kind of uh uh for your arm for

legs. Yeah. How much angles you want to

legs. Yeah. How much angles you want to move. In fact uh for this kind of uh

move. In fact uh for this kind of uh computer use it simply means to click on this single click on this coordinates or double click on this coordinate or right

click on the coordinates or click this keyboard combinations. Yeah all the so

keyboard combinations. Yeah all the so on so forth.

Uh we have two ways to train these kind of models. The first option is to train

of models. The first option is to train a main model to directly output the MOS click coordinates.

uh that means a major LRM just is trained through the reinforcement learning to generate a massive number of test pages for the reinforcement

learning and they can we can leverage the large language models coding capability to generate a lot of pages because of mouse click is very easy to

verify through the simulated browser environment. It is a very great uh way

environment. It is a very great uh way to reinforce and that's just the major LM to draw out and needs to uh learn how

to ground these elements into the exact coordinates.

Uh but for this to work we need to make sure that the screen resolution is fixed. So that's the exact reason when

fixed. So that's the exact reason when we use openua or gymn 2.5 pro or use the cloud for

sonet these are the models that the reinforcement learning post train for this kind of computer use task we need to fix the resolution to one of their

train values and uh there is another uh uh uh pit pitfall because we we do not have the most movement trajectory.

This is very similar to the web pro bes uh sorry the crawlers the crawlers just use this kind of play right to locate element and click there the cloud is the

mouse is not move there but it is simply just click that location and regardless of the current mouse location.

So some of the uh captas just use this kind of uh behavior to determine the bot. So this is very hard for the agents

bot. So this is very hard for the agents to navigate through these captions.

And also uh we know that the major of the of either voice agent or text base main agent or computer use agent it will

be a very large model. So doing IR on that reinforcement learning on that is very costly.

So we comes with another option that is to train a separate VE model in conjunction with a main like language model

to imitate the human mo movement patterns. By imitating it means that

patterns. By imitating it means that it's just a move fine twin and click similar to a real human real human moves

most uh sorry for example when real human moves mos it cannot just simply tell you exact location tell the x y

coordinates of that mo instead we just simply generate the data of current mo location uh to the final mos location

similarly generate a kind kind of move trajectory in this trajectory and uh it is similar to the real world trajectory

of human. So if I if we just use real

of human. So if I if we just use real trajectories for as a training data so it can not be identified as a robot uh especially by passing most of the

captures.

But for this to work, we will need a low latency V model that we can quickly adjust most pointer position based on

the current kind of uh screenshot based on especially based on the current position of this uh and the target uh

element position and it is to quickly adjust the next action based on the relative position of the MOS pointer and the target element. ment.

So you need a model to be small but but small model works because it only needs to do the grounding task. It does not

need to do the reasoning task uh because uh uh because objective it takes as input is simply the the output of the

main life model.

For example, clicking the product in the third row and the second column or to click the search button or to type into the search box. Yeah. Yeah, this is very

simple. So, we know that grounding is

simple. So, we know that grounding is simple because uh the traditional restnet kind of models can produce quite good grounding results. So, if we

fine-tune a small uh BE model to that, it it works very well.

And the in the next step we want to have a a better text to speech model. Uh the

previous slide talks about the VA model that operates the computer use and for the text to speech model uh we we we are

targeting the real time voice interaction uh through the phone calls.

uh we know that a lot of uh current uh kind of TTS models are not so natural when generating voices.

Uh the major major problem is that it's too perfect because there is no such kind of pauses, no filler words, no repetitions. But

real humans have all all kinds of this.

The major source of this pauses of fer words is because humans think sync slower than the machines

because AI sync is very high syn think syncing speed is very fast is 100 tokens per second. We cannot imagine any human

per second. We cannot imagine any human syncing 100 tokens per second but real output is just five tokens per second.

So it as is actually over this wound it looks like a podcast not like a real human. So we actually needs the main LM

human. So we actually needs the main LM to generate some cognetic pauses before the text to speech generates that

speech. For example, we can generate

speech. For example, we can generate this kind of sentence transitions, the searching for words or we have the TTS model to generate these kind of audio

based on this text and the control actions. For example, the sinking

actions. For example, the sinking certain uh uncertain and uh searching these kind of things. We have this control text

for example we al we only have the thinking and searching uh text that means fther words but also in suicide

the emotion and speech rate for example speak slowly or speak faster or we have happy or we are very angry also have different speaking styles or

even it can output some special sounds with this kind of special tags of special tokens outputed by the main ling

model. The TDS can render a much more

model. The TDS can render a much more better kind of natural speech that is similar to a real human and it is easier

to bypass a robot checks because most people do not want to speak with robots.

So if your voice agent just sounds like a robot, the human is likely to hung up your call. This is the reason we want

your call. This is the reason we want the speech to sound natural but not very professional at a professional podcast.

So here comes the architectural summary of SE which means we have the uh real time interactions with agents with real humans.

We have the prop perception layer, the syncing layer and execution layer. The

perception layer takes an input of the voice and GUI changes and output is discrete event stream including the

transcripts, the UI changes event and also uh the acoustic events and the syncing layer takes input of the

perception layer and it outputs interlinking and action commands.

The most important thing about the thinking layer is that we adapt a reactive think interactive react kind of paradigm to resolve the sequential par bottleneck

of the traditional observation thinking and action. This kind of loop we can

and action. This kind of loop we can allow the observations and actions to be interled with the thinking because the thinking takes the most of time and it

is happening continuously.

So we can do the same thing while we are listening and also we can do the same thing while we are speaking.

The whole process is totally interruptible. That means uh the new

interruptible. That means uh the new events can be processed immediately without interrupting the existing syncing process and also it's fully

asynchronous and it can output uh multiple runs multiple turns of syncing and uh output events output actions

instead of the current syncing kind of models because the current syncing models simply sync and action and it is over. But we can think action and think

over. But we can think action and think again and action again and so on and so forth until it feels that there is no nothing more to do.

And for the execution layer, it takes the action commands as a of the syncing layer. action commands including the GUI

layer. action commands including the GUI operations including the text to speak and also including the control commands of the text to speech models and it

outputs the continuous signals including the most movements. We know that we train a

most movements. We know that we train a v model to make the mouse movements similar to a real human. It is

continuous MOS trajectory instead of simple mouse click event. And also for the TTS we output audio tokens which is

then FA decoder audio decoder to transform to the waveform through the melogram.

In the future we would like to outlook to see a real world uh to to see end to end model. Yeah, because we now have the

end model. Yeah, because we now have the acoustic event, the text and also the uh actions as a intermediate way to

communicate with kind of order. But uh

this definitely there is some we duplication in the three large language models because all the layers of perception, thinking, execution and large language models. The single layer

is the largest L model. There are some duplication on board and also there are some loss of information uh because we need to uh make all these kind of

multimodality information into the text.

So in the future we would like to have a consolidate all the three layers of large language models into a single unified one and the perception layer and

the execution layer is simplified as a simple audio encoder and decoder but this is remain to the future work. We

expect uh the audience to explore that and uh this is a future outlook of our real time agents that are three levels

of AI agent interaction with the world.

We have talked about the real time voice calls and graphical user interface operations the computer use and we know

that the voice have a data density of tens or even for example 50 or 15 kind of tokens to per second and the GUI

operations is much harder because it is keeps changing the uh if for example if we want to uh

uh do a realtime gaming of GUI kind of operations. Uh our cannot do that

operations. Uh our cannot do that because the graphical user interface is changing too fast and our V model or our

major large language model cannot keep the space keep the speed with that. But

in the future if we have a better understanding of of all this kind of task especially if we have a better

major language model to understand the quick sequence of the quickly changing GUI interfaces we can in theory we can

let the agents to play the video games the realtime video games with schemas and the next step even harder is about

the physical word because there are vision, voice and another modality that is touch and also

the for output. It is not only the mouse clicks, key prices and also not only the voice but there are also the drawn

action sequences. We need our sequences

action sequences. We need our sequences about how how much angle they want to move and or maybe we have the u coordinates the threedimensional

coordinates in the space and this is much harder and the latency requirements is very low. For example uh the latency we need less than 100

milliseconds of latency otherwise uh for example if we just touch something and uh it You do not have any kind of feedback

loop. You will rely on very reliable and

loop. You will rely on very reliable and highly accurate hardware which will be very complicated and very expensive. But

humans do not do things like this. Human

do not rely on this very expensive and very precise control as we just do not click the buttons by outputting the exact XY coordinates and we just do not

touch something by outputting this kind of XYZ coordinates in the space and we move to that it is a very fast feedback loop. So enabling this kind of fast

loop. So enabling this kind of fast feedback loop to control that this is very important.

Uh so this is the end for the part one the real time agents and this is next we go to the part two about agents learning from experience.

So the reason why the agents needs to learn from experience we know that there are very many sort models. So we just take a solid models here and give some

context and give some tools let it go.

So why do we need to learn from experience at home if we already have this sort of models at hand?

Uh the reason is that the sa model is like a top graduate. It is very knowledgeable but it lacks kind of experience.

However, in our business, for example, in Pine, it just helps busy users to negotiate with the customer service.

We need to verify kind of information.

For example, if before we make the phone calls, before calling the customer service, the com we need to anticipate what the customer service needs to verify their identity. For example, they

need the four last four digits of a credit card. they need their full

credit card. they need their full address or maybe needs a confirmation number or so on and so forth.

For example, if for the first time if we do do not know that and call the customer service and fails. So can we just uh when the another user just do

the same task, can we anticipate the actions of the customer service and ask the user before actually calling? So it will lead

to a much better user experience.

And second is for the task procedures.

For example, if we want to cancel a task for cancel a service and for the customer service say that you cannot cancel that on the

form. You need to fill out the form on

form. You need to fill out the form on the website. So the next time we do that

the website. So the next time we do that same task, we do not need to take the over the phone at all. We can simply

call the computer used to to do the kind of uh feeding kind of form on a virtual computer, it will be much simpler and much more

efficient.

And third is for the business rules. For

example, if we uh can antic anticipate which kind of discounts uh which kind of people can receive certain discounts for

example veterans or loyal customers for over two years and estimate the price. For example uh in for example we have a three kind of

gigabits per second com broadband and on a certain area for example different counties this is quite different we we we can do much better

strategy planning for this kind of negotiation and we can also control the user's expectations before continuing.

we can know what kind of preferences questions we need to ask user before actually contacting the customer service.

These kind of business challenges are dynamic and the actually the business decisions the rules the guidelines are not public disclosed.

So if we simply improve the general capabilities of base model it cannot solve this kind of experience based problem. So this is why we call the sol

problem. So this is why we call the sol models as top graduates. It is very knowledgeable but it lacks experience in

some specialized task.

So actually learning from experience is a very fundamental problem in the machine learning. We know that there are

machine learning. We know that there are three paradigms paradigms in machine learning.

The first paradigm is the old paradigm that means that through the changing the gradients including the pre-training and post training

the method is parameter update.

So the and second paradigm it comes through when the last language model comes we finally found that with large language models we actually do not need

to update the parameters. We do not need to do the finetuning for a specialized task. Instead, we can do the implicit

task. Instead, we can do the implicit learning through the attention mechanism. This sounds too academic, but

mechanism. This sounds too academic, but it's very simple. It just put the prompts and few short several few examples into the context and the lab

language model will follow your prompts, follow the rules and follow your uh examples to do predict the next.

So it simply use the long context as a temporary memory.

And this is second paradigm and almost all agents happen in this paradigm.

And the uh now we have the third paradigm that means externalize learning because we can externalize the knowledge and processes into the tools and into

the knowledge base. The knowledge base is textual representation of this kind of uh knowledge.

And the tools are use code to achieve the self evolution of the agency.

We will introduce all these three paradigms one by one in the following slides. And we we have the core insight

slides. And we we have the core insight is that previously we have only the first paradigm. We can only optimize the

first paradigm. We can only optimize the model parameters through the gradient descent. But now we have more ways in in

descent. But now we have more ways in in context learning and knowledge base and tool generation which is does not have any updates to the original model

parameters and much more efficient in learning a specialized knowledge in a specialized area.

So the let's go into the first method of the post training that means using the reinforcement learning to make the agents pro efficient in a certain kind

of task.

For we we all have talked about uh misform learning and some people may just think we can make the agents make the left language models very smart

through the IO but this is very hard actually uh some researchers have pointed out resol learning has is very hard to improve the

intelligence improve the raw intelligence of a model. The reason is that reinfor learning it needs something to reinforce. It needs to success. It

to reinforce. It needs to success. It

needs at least one successful uh output, one successful trajectory in order to do in the row out process and then you can give reward to this process and then you

can convert the pass K into a path as one that means if I have uh do the roll out for 32 times and find one that's a success and through all this kind of

training process we have reinforcement learning and you can have the 90% of this task succeeding. So from 3% to 90%

that is possible but if uh it is 0% that means the intelligence is not enough. It

does not have any possibility of outputting this kind of correct behavior. So you have nothing to

behavior. So you have nothing to reinforce.

So the true value of IO is that it to internalize hundreds of complex rules. For example,

this kind of no house in the specialized area. You can do the better instruction

area. You can do the better instruction following. For example, there is

following. For example, there is anthropic paper that takes about more than 50 strange rules about coding about outputting anything

irrelevant kind of rules into a cloud three hack model.

And uh because the hyper model is very small so it cannot follow all these kind of instructions if you simply use in context learning. But if you use

context learning. But if you use reinforcement learning with schema feedback or uh you can just simply treat all these kind of mod uh instructions

into convert that into uh a training set and do the continue pre-tuning and also the supervised van twinning and the

resulting model will strictly follow all the more than 50 kind of arbitrary rules and also for the reinforce learning you

can train it to do the two cores. For

example, it it can convert the pass K into pass one and also it more importantly it can do the latency reduction. For example, for the OS

reduction. For example, for the OS agents and GUI operation agents, we would like to have low latency in the operations. But if we just simply use

operations. But if we just simply use the in context learning and we rely on the the internal thinking the test time

scaling to just uh scrutinize each of the rules in the guidelines. It actually

works like a ch checklist. It's go use utilize the syncing tokens to go over the rules one by one. So if the latency will be not be very low

and if you really want the latency to be low and you need to internalize these kind of instructions and rules into the model itself.

So this can reduce the chain of thought and have a better latency reduction and uh for example uh if we just want to

do use two use and the two use is also kind of guided by some kind of rules some of the rules is in the tool description to describing what tool is

for what kind of purpose and some of the rules are resides in the system prompt or task prompt and for this kind of uh agent operations especially for example

we need to use phone calls to communicate with the external world we need to use the computer us agent we need to use a web search and knowled search a lot of these kind of different

tools to do these these things seems very uh this kind of process it can be written into a very long for example 500

lines of the procedure document this is very easy for the IO process to remember that. And so we can simply remove that

that. And so we can simply remove that 100 500 line kind of procedural document from the um from the model from the

context and it makes the the the latency is much lower and also the cost is much lower.

So uh some people may may just simply think that uh the resourceful learning is very expensive but actually it is not expensive. uh we

can refer to a paper. Yeah, this paper just uh do a simple uh this called I2.

It means uh uh uh this paper just uh trace the model to do the two calls uh to uh to write code to solve the mass

problem and it it actually uses the automatic trajectory generation and simplify rewarding them and training efficiency

is very high. You simply need for example $5,000 of GPU time just several hundred hours

of GPU time to train this specialized our model. And this model's

mass performance is on par with one of the most uh sort models. So if you

simply want to train the model to work on your own specialized area and do not need much generalization performance

is very great.

But if you actually want to generalization purpose over a lot of different domains it will need really tremendous comput computational power

and you will need m much more data. So

this is some most recent agent research.

For example, Kim K2. Uh this is model as agent. It performs a large scale agent

agent. It performs a large scale agent data synthesis and join ris learning training.

For example, in this the most important thing about this is a model as agent concept and also it tells us a very important

concepts of a model as product because if you really want to do the reinforce learning training you need to build a simulation environment and the simulation environment is already a

ready product because for example you need web search and you need this computer use you need kind of this kind of tools to kind of parse this kind of

thing. So the only thing uh that's

thing. So the only thing uh that's missing is a UI. If you do a user interface, this is a complete product.

So once the model training is complete, the agent can be directly released as a product. So this is very important for

product. So this is very important for our agent developers. We do not need to compete with all these kind of

foundational model companies.

they just build a general agents and the general agents want the model is they train and they will convert it into a product. So there is

product. So there is no kind of advantage of a pure agent company.

So for agent company to have some advantage you must specialize must specialize into area and for example for

the voice agents we have talked about a lot of specialized models including the ASI models the streaming perception models and also the TTS models and also

for the low latency we have the specialized thinking while speaking and thinking while listening models and also

We have this GUI operation models which resembles to a via model to do the MOS operations. So this kind of specialized

operations. So this kind of specialized models as what why we exist as a sort of alien company

and also we have some uh some knowledge base we have this kind of specialized knowledge and agentic processes and PRs

and knowhow for this specialized area.

So this is why we understand the current process of the model as a agent and we we we will know

this kind of train is coming.

So our uh actually our our agents is also model native as we talked uh in our last several slides about the SE

architecture of the real-time agents.

The this model itself are trained in an engetic process through a simulation environment not just talking with real humans but it is using the large

language model as a judge kind of thing to evaluate to do the robot and do the reinforcement learning itself

and for the critics and there are here are some uh semat systematic approach from Kimik2. So from the rules to self

from Kimik2. So from the rules to self pretty uh and we just skip this because these are techno technical details

and let's go into the method two uh in context learning in method one we talked about post training we talked about method to just to to do the fine tuning

of the parameters to train knowledge into the parameters but in g context learning is enabled by the large language models because it can

understand the uh context actually we can uh if we know about the transformer architecture we must know about the attention mechanism the attention is similar to

informationational query is similar to a key value store but is it is not accurate QML store for example in a key value store traditional QL store inside

systems area the keys is just a string and the value is another string and we compare the query with the keys and

fetch the value. But currently uh all these kind of query key values is adapted from the systems but this is

much different. This is based on the

much different. This is based on the cube similarity instead of exact match.

The keys are just represent uh the information inside all the previous tokens in including the input tokens and

the tokens it has already output and the queries uh no uh represent the need or the kind of information in the current

token. So it just compares the query

token. So it just compares the query with all these kind of keys and fetch the most similar ones and then we add these kind of values together. The

values means the actual time content of all these historical input and output tokens. We add them together and this

tokens. We add them together and this the resulting weights are waiting. So we

just scan through all these kids. So we

we can talk about this as a Bruce Boss approach. For example, when we just

approach. For example, when we just design for the any kind of database or any kind of key key value database or can relational database in the

transitory system, we need a lot of indexes to do this kind to filter out most of the keys. So we can only focus on

the keys. So we can only focus on several keys to to do faster. But

transformers is different. We can just leverage the almost infinite kind of computational power. uh actually it

computational power. uh actually it takes a lot of cost. So it is just a brute force approach. It is like what we

see that in the it's a bit less with higher computational power comes higher intelligence. Yeah. And uh so and it is

intelligence. Yeah. And uh so and it is also different from a vector database.

So someone may think that the uh transformer is similar to a vector database but it is actually different because vector databases have only the keys and values combined. This is just a

content to be searched and the query vector has a same distribution with the values because both are the embeddings of the text.

But in transformer actually the QV the distributions of all these three vectors are different. So we cannot simply use

are different. So we cannot simply use the same sparity kind of assumption as a vector database. So this is a very

vector database. So this is a very important uh kind of observation which will make the sparse attentions not so easy to work. Uh but before going to all

these kind of details we can talk about some several bottlenecks of all the londex uh which is memory wall and the computer wall. uh we we have very been

computer wall. uh we we have very been we have been very familiar with that for example we have the excessive kv cache uses usage because it's a batch size

times the sequence it means input output token size and uh it means uh the time by the the d of the model that means uh size of the hidden layer the the

dimensions of the hidden layer uh so actually uh we have a lot of optimizations to do the memory wall and we also have some ways to do with a

computer wall because it's a all order of n squared kind of attention capacity and for the memory wall we we we know

that there is MOA high lat uh attention it is a fundamental mental method to reduce the KV catch actually this is very uh simple kind of approach approach

it just converts uh just a dense attention full attention into three parts. The first part is a

parts. The first part is a summarization. It means to summarize 16

summarization. It means to summarize 16 tokens into one attention uh into one embedding vector. And the

second one is a sparse attention. It

means sim similar like a vector database uh it just uh ignores the vector that are not so similar with them. And the

third one is the recent conversation window. So it is actually why it caused

window. So it is actually why it caused uh uh sorry I was talking about natural spent

uh sorry this this slide is actually about MLA I was talking about NA instead yeah for the MLA yeah actually it

performs a low rank low rank projection of of the key key value vectors into a low dimensional len space that means uh to compress that and we have a share

representation of the multiple attention head. So they will have the same

head. So they will have the same lowdimensional key value representation.

Uh but this is not easy because if if we look into the uh the context look into

the formula of the softmax attention it seems that this is combinational but but there is small small thing and is about the rope.

This is a rotational embedding rotational positional uh encoding because we need to encode the partitional information into that and uh

but the con uh but uh if we have this rope we can see that in the formula it is no longer combinational so it is uh so it does not work

but uh if we want to keep this positional encoding information uh so the MLA work just do a very simple trick it just separates the positional

information from the main content. The

main content is passed through the low rank space through the uh kind of compressed CQV and also the precisional information is processed uh

independently through another pass and then we combine the information positional information and the content together uh and into a later in a later

layer. And this way we can reduce the KV

layer. And this way we can reduce the KV cache uh by uh 16 times and it will be much faster. So when we go to the

much faster. So when we go to the compute wall the compute wall means uh it's a coincide to train the QV cache similar to a vector database. But as we

already said in the several previous slides, it is very hard because the key QV the three vectors does not have a

same distribution. So if you want to

same distribution. So if you want to simply find the top several keys that pbounds have a highest attention with a

query, it will still requires the scanning a very large proportion of keys.

And so some works does propose to use some heristics for example some attention syncs for the first few tokens and some reason tokens uh to use for the

filtering and for example the we know that there's just some attention syncs behavior for the first few tokens for the first few few tokens when the

training because soft max has to one and sometimes we do not know which token to attend to and then we just by default put attention to the first several

tokens which serves as attention sync and for the recent tokens we need to take pay attention to them in order to do the next token prediction we need to

know the context of the immediate several tokens. This is very natural uh

several tokens. This is very natural uh just kind of observations. But if we just keep this kind of special uh just keep these tokens based on spatial

observations some of the highest m missing keys will be missed. So this will be a coveret of

be missed. So this will be a coveret of this kind of sparse attention and uh there is another way and and a

more uh I think this is the more fundamental way. It is the linear

fundamental way. It is the linear attention is to change the computational order to achieve the linear compatibility.

So actually this originates from the because a transformer is produced to reduce kind of uh to to overcome the

limitations of the RNN. So recursive

neural networks and but to overcome this you have the paralyzed computation but also have more computation to do be to be done. But if you just rewrite the

be done. But if you just rewrite the standard detention the softmax using a kernel function you can simply put the Q

and K uh multiplication you can split them and let the keys and values to do the multiplications together and then because the key values are

multiply together it will be avoid the construction of the very huge N * N matrix where N means the number of tokens in the historical

Cocus means the contact lens it is very huge. So the advantage is very extremely

huge. So the advantage is very extremely low computational cost and also it means the uh KV catch size and the memory and

also the computational number it does not increase with the number of tokens.

This is a very good behavior because it it means that we have almost infinite kind of tokens in the history in the context without any increase.

But this is obviously a lossy compression because we have informationational theory right we have information theory we tell that it cannot have be infinite so it must lose

some information and so the needle in a haststack can retrieval for example uh the capability is a

foundation of the in context learning because for example if uh if a If it's something does not be

able to do the needle in the haststack and it is not good for instruction only and is not also not good for the longchain reasoning capabilities

because all these kind of thing based on the income context learning and also the instruction following and longchain reasoning are the foundation for the

tool use capability. So if you have only use the linear attention you are not likely to build a really good agent because the agents use the most frontier

to use which depends on the description the uh instruction following and also depends on the reasoning and launching.

So this is for this problem. Yeah. Uh

and also this is another problem with linear tension that the soft max tension often performs better than the linear tension for the short context. For

example, we have a context length of 2,000 and the question is very short and with the this just have a same key value

size uh uh cache size. Yeah, for this short questions the performance is linear application is not as good as softmax.

So to uh to overcome this kind of limitations actually there are several approaches we will talk about two approaches. The first is Google's

approaches. The first is Google's infinite attention and the second one is minimax attention lightning attention.

The first Google's infinite attention uh is simply use introduces compressive memory into the traditional attention mechanism. Actually it has two kinds of

mechanism. Actually it has two kinds of attention mechanism. The first is local

attention mechanism. The first is local attention. It's just a master attention

attention. It's just a master attention soft max attention for the current segment and long-term attention is similar to a linear attention. So by

combining these two we have uh still have a uh memory efficiency. We have the fixed number of

efficiency. We have the fixed number of memory updates here but uh we uh but but we will kind of support infinite context.

Actually uh when they implement this this they are not simply using a linear tension but they using a compressive storage for the long-term memory.

So uh the length of the context is also still limited but it is much higher and much more memory efficient than a simple

soft max attention.

And the next uh example is the natural spas native spas attention for the deepse

in last several slides I may mistaken ns as MLA here comes the ns it means we have three kind of different

kind of attention mechanisms the first one is a course print is a quotion uh tokens compression that means a summarization

It's similar to when we use the summarization of a lung context to reduce the token usage. And second

mechanism is fine grain token selection that is similar to a vector database. We

just use a vector database to select the most uh common and most relevant kind of snippets of of the context into that.

But this is a precess context not the summarize. That's different from the

summarize. That's different from the summarization.

And the third one is the recent uh communication uh uh sorry uh the recent window of the conversational context. Uh

this is kept as a soft mask attention unchanged and so uh this full attention for the recent window plus the kind of

uh sparse attention uh for the token selection and kind of uh core screen summarization token uh the attention s uh k cache compression. This kind of

three mechanism working together can achieve the significant acceleration of decoding.

And now we come to the last attention of minimax. Actually miniax use a uh a

minimax. Actually miniax use a uh a drastically different kind of way because it it use propos a kind of thing

called latention is a combination of linear attention with softmax attention.

Uh in theory we can all only use the linear attention but we have talked about that because linear attention have a lot of drawbacks. It can the first for

the short context is not very good and then for the long context the in context learning capabilities and the hstack in the needle capabilities are not very

good. So we we just simply uh minimax

good. So we we just simply uh minimax takes an approach of inserting one softmax attention block after every seven lightning attention blocks. So we

have 80 layers and seven layers of lightning attention and one uh sorry 10 layers are softmax attention. So this

strikes a balance over the long and short run dependencies.

uh in this way mini supports more than one million tokens during training and more than uh more than four million tokens through the during the inference

and it is also a open source model.

Now we give some insights about the learning yeah because we have talked about uh this parameter updates through reinforcement learning and post training

and we also talked about in context learning. So many people are thinking

learning. So many people are thinking what's the difference about inter context learning and the parameter finetuning. Actually uh several the

finetuning. Actually uh several the recent advances in theory had pointed out that uh especially this paper

learning without training uh talks about that transformer blocks actually implicitly updates the MLP ways through

the context. It means that uh actually

the context. It means that uh actually the uh the in context learning the attention mechanism is doing the rank

one updates to the MRP part of the original parameters.

So actually each uh each time we output one token the mechanism uh of the attention provide produces equivalent of

a rank one updates per each attention head. So if we have 32 attention heads

head. So if we have 32 attention heads we have a low rank rank to 32 kind of updates to this MLP.

This is why language models can learn from the new patterns from the prompts and from the few short examples and also from the agentic two call trajectories

without any kind of twinning.

This implicit weight up uh update mechanism it provides a zer theorical basis of above the designing the better prompts.

So here are some counter in intuitive findings also in that paper uh sorry in that in another paper that deeper size without updates

because for this task containing the implicit patterns the in context learning can capturize a capture and utilize these patterns better than the fine tuning. This is quite

fine tuning. This is quite counterintuitive because most people think that uh just fine-tuning a model it just let the model internalize it

thoughts. But if you to take this in in

thoughts. But if you to take this in in time learning the attention is very uh sparse. But as we have already said in

sparse. But as we have already said in the previous slide that actually the attention mechanum is not sparse. It

just it is equivalent to a low rank update to the original ML parameters. So

um it just focused on currency. For

example, we do some uh do uh do do some uh uh do do some research through this uh using the circuit shift theory. This

is proposed by this uh paper. Why is

this SEO is probably better at pattern recognition?

Uh the short circuit means the subg graph in the model that is responsible for uh specific behavior that is uh

compos specific attention heads and ML layers representing the model thinking path for solving a problem and in context learning leads to a larger scale

circuit shifts that means the circuit after the prompts after the few short uh examples and after this kind of agent

working trajectories is different uh it causes a larger difference in the circuits that's responsible for specific

behavior compared to using the similar data train data to train uh supervis uh to to fine-tune a model.

So this means the P and another example is C pattern capturing. It is by manipulating the activation values of

the specific components and uh ask the question can you quickly capture the patterns and of the example

inside them. But the pattern capturing

inside them. But the pattern capturing needs attention to the few short examples of the prompts.

And uh actually we find that the uh that in context learning and not a simple pattern m matching but it actually activates several different

computational circuits in the model.

And this uh a and even with a thousand of times more training examples the improvement of finetuning is very limited. So this explains why in size

limited. So this explains why in size learning can achieve binary results even without any parameter updates.

So this simple efficiency of intense learning is is large. So this is also have a implification on our agents.

Before we just try to use refor learning or try to use supervised fine tuning to finetune agents, we need to first think whether or not we can simply use in

context learning, use a prompts, use fial examples to do that. If is it possible, we can do uh we can achieve a

better performance with a better accuracy.

and with a higher efficiency.

Okay. So this is the end of incontext learning part and the next we go to our method of externalized learning.

uh we we talk about externalized learning as a third different paradigm from the first that is parameter updates we are gradient descent and the second

in in learning because this is externalized external to the model itself it seems like it is just a hack but it is not just a hack let me explain that

and this is the most important part of this the first mechanism is knowledge based and second mechanism is to two generation

and let me first talk about the external knowledge base that is to rapidly accumulate experience from this trajectories and also applying the

experience to the new trajectories to the new executions.

Uh so we know as a knowledge base right a knowledge base has two parts the first is knowledge representation and the knowledge retrieval for the knowledge representation we have

several kinds of different kind of ways to do that the first way is through the summarization we can store the summary uh for example to cancel service B from

company A you need to verify the user's identity by providing order number reg registered email and last four digits of the credit card number. This is just

some summary and conclusive information into that. And another way is through

into that. And another way is through the lo that is for example the complete record of the transcripts you contact the customer service and you need to use

the descriptive file names and store in the file system for the raw details and uh and we know that summary and raw details are two ends of the spectrum but

there are some middle points that means the treebased summarization for example there is raptor raptor is great work For tree based summarization that simply

treats this all kind of raw details and raw trajectories as the lift nodes of the tree and perform the clustering to generate the summaries and the cluster

summaries to generate higher lows of the level of upper level and to the root level. And when you do the retrieval you

level. And when you do the retrieval you have two ways to do that. The first way is to use a vector database uh to just for the similarity matching through the

lift nodes and intermediate nodes and also you can index it like the file system. You can go over this using the

system. You can go over this using the descriptive file names uh using the agentic way and uh a final way about summarization

is to do the agentic autonomous summarization that is for example this this is the reason is that the summaries and all details and treebased ones are

just uh not agented. This is just a fix uh happen at fixed points of the agent working. But uh in the last way we can

working. But uh in the last way we can let the agent autonomously decide when to write the summaries into the file system and what to write. For example

for the rat file anal file these kind of things. Uh we know that a lot of agents

things. Uh we know that a lot of agents already use that. For example, chat GBD and cursor they they use autonomous summarization for that uses the agent to

autonomously determine that and for the retrieval part actually we have uh automatic matching and context insertion. The first way is just to uh

insertion. The first way is just to uh it is suitable for the fixed kind of information. For example, based on the

information. For example, based on the user ID, the profile is of each user, the contact information of each user and also for example for the business, for each kind of business, the contact

number of the business or the working hours of the business. This is kind of fixed kind of info. We can feed that into a very rigistic schema. So we can

do the automatic matching and insert that into the context. So there is uh uh no uh no No need for the RM to

autonomously decide that. And the second way is agentic semantic search which means the agent can autonomously decide which keywords to use for searching. For

example, it can use the vector databases on BM25 to match against the knowledgebased keywords.

And the second uh third one is to use agent tech file system reading is to let the agent automatically use the tools for example refile find grab etc to read

the content from the file system. uh

this uh uh this all these three ways are very important and uh uh and I want to highlight that the agentic way is very important because the semantic search

and file system reading are the two ways that the agent and autonomously decide to use. For example, it cannot find any

to use. For example, it cannot find any matches through the semantic search. It

should result to a job uh to to a fallback way of using the for file system search to read the file itself.

It is very important to give the uh redundant and abductant tools for the agents to leverage all kinds of uh uh

search capabilities similar to how human uses a knowledge base user database or use a computer file system.

And now we have a small tip. The tip is based on anthropics contextual retrieval. is a very important tip to

retrieval. is a very important tip to improve the retrieval accuracy of the rack. We have a core problem that uh the

rack. We have a core problem that uh the traditional rag loses contact when trunking documents. For example, in this

trunking documents. For example, in this example, we have the uh company's revenue grew by 3% over the uh previous

quarter. And if we do not have this uh

quarter. And if we do not have this uh context uh of of this document trunk the retrieval accuracy will not be very

high. But we can have a summarization to

high. But we can have a summarization to add a explan uh explanatory context to

this trunk. And so we can uh we can have

this trunk. And so we can uh we can have better understanding when we do the when we do the retrieval.

And now we comes to a comparison of the learning methods. The external knowledge

learning methods. The external knowledge bay and the building intention. The

building intention means on long context or the in context learning which we have just talked about. And for the external

knowledge base uh it is different but uh we if we do some simple analogy actually it means uh getting a light language model that don't support

syncing to sync using a prompt using sync step by step using a chain of thought but the building intention is actually a model that native sports and

long context why we talk about this because external knowledge base is basically based on vector database cases and based on the fes indexes

and the building attention has also have the query key values inside the attention the query and keys are just matched it through against each other so

it is a built-in kind of vector database or knowledge retrieval process uh so in theory the end to end optimization of the long context in the

long context learn uh in in contest learning will be have a higher potential for the effectiveness.

But uh actually rack has more pros. Uh

it is not not so so bad because uh actually we can use additional computation power to

summarize. The summarization is very

summarize. The summarization is very important. We don't simply post the

important. We don't simply post the original original transcript or agent execution trajectories use uh using the vector databases because we have the

summarization previous summarization and agent autonomous summarization and also we have the contextualized embeddings and contextualized indexes.

This transformation using the additional computational power is very important and also the comput during the summarization because we can control the

prompts the prompts of the summarization it can flexibly incorporate some industry knows of the specialized area.

So in this way rag may sometimes perform better than the long context and let me also compare the rag with

finetuning.

This is a based on the paper called finetun your retrieval and the core insight of this paper is that the retrieval argument generation is not

only more effective but also avoid some knowledge forgetting and conversation problems that fine tuning can cause.

We know that uh red is very uh good at retrieval of the accurate information or factual knowledge

but fine-tuning is not very effective at remembering the accurate factory knowledge. So and and also rag is very

knowledge. So and and also rag is very good at handling the new domains because it is a general technology but

fine-tuning requires data data augmentation. It requires retraining to

augmentation. It requires retraining to uh to to to uh uh uh to prepare multiple expressions of the same knowledge

otherwise we cannot remember them.

So ina in terms of uh generalization and in terms of learning f uh learning uh

efficiency red excels but uh sometimes uh fine tuning works well because when some task there are something that is hard to represent in text for example

how to ride a bicycle how to speak like a human how to speak more naturally how to uh speak with a IEQ

uh this is these kind of things very hard to represent in a text document. So

these kind of things are better for fine-tuning or reinfor learning through the first paradigm of learning that is through the gradient to update the model

parameters.

Uh actually we have the rag and the rag may enable some large scale knowledge summarization uh through the large language models.

Uh we know that uh we have traditional knowledge base and the expert systems uh this this kind of expert systems are not very scalable. This is because the

very scalable. This is because the knowledge is very fragmented and also the query is very efficient. But with uh

last language models when you have a a very uh very capable worker it can extract long context and extract this

kind of knowledge from the long context because humans are not good at reading long context. Actually, if you just want

long context. Actually, if you just want to read a novel of 100 thousand

tokens, it takes you a long long time.

But if you just feed this 100,000 tokens into a sort of reasoning model, it can all book any kind of answer any kind of

a question in one minute.

So it is actually more capable for large language models to extract this kind of knowledge of a specialized domain domain

into a compact representation and uh the next is for the tool generation is another way to do the externalized knowledge infection. The

first way we talk about is about knowledge base and the second is to enable the agent to write code to generate tools to achieve the self

evolution.

Here we have an idea from the Lita paper. The lita is from minimal

paper. The lita is from minimal predictive definition to maximum self evolution in GAA benchmark of general agents. It out outperforms many

agents. It out outperforms many complexity designed agent systems and ranked number one in the benchmark validation set.

Actually the core idea of this is very simple is to just to search the GitHub or other code repos for the opensource tools

download the tools and try to run them and also it also use the code generation technology use the code generation

capability to generate tools write code itself and then uh uh we have tools from the open source tools and we also have tools

written by the agent itself.

But uh using this kind of uh uh self-generated tools is very hard because we quickly made a n large number of tools.

For example, we have generate generated 10,000 tools or we has connected to a m MCP server. MCP means model context

MCP server. MCP means model context protocol. we contact to MCP server with

protocol. we contact to MCP server with a lot of tools.

We cannot put that into the context. For

example, if we have the complete to set of the GitHub MCP, it will require more than 200,000 tokens, the context is

explosive.

So the agent uh if we let the agent does do a passive selection of the tools it will be very ineffective.

So we have a paper called MCP0 which performs active to discovery for the autonomous light language model agents.

It converts a passive agent directory into a act active uh uh active

lookup. We let the agent to actively

lookup. We let the agent to actively identify the capability gaps and request the tools on demand. We do not put all the descriptions and all the parameter

descriptions of these tools into the same context of a light language model.

We can first select the server which means the platform domain and then load the tool list and then select the tool

and also for the tools we have multiple types. We have a hierarchical

types. We have a hierarchical organization or similarity search based organization of the tools. So the model can search

the tools and then load the tools and finally execute the tools.

This is for the tool scanning.

Uh so in uh uh in the next part let me give uh three use cases of the agent

self evolution of the two use. The first

case is to generate a intelligence RPA for the computer use. We know that computer use uh suffers from the low speed and C cost. We have talked about

that in the first part of the talk. And

for the traditional RPA, it also has a very strict limitations because it's a strict script and it cannot adapt to any

kind of dynamic interfaces or cannot know the changes of the websites and most most importantly it requires a human to write that. It is very cost

ineffective.

So uh to achieve this we actually in uh to uh to do the uh element localization to generate RPA code.

Instead of using the light language model to analyze a screenshot and produce action each time

we light LM to generate a trajectory which includes elements clicked in each step.

And to satisfy the dynamic content, we distinguish between the fixed processes and dynamic judgments. And we only extract the operation sequences which

are automatable.

And in addition, we have several engineering kind of tricks. For example,

to keep uh to tackle with uncertain uncertain response time, we use the event- driven execution. we listen to the next event uh to listen to the for

the event completion states.

So uh in this way we can actually generate IP program very effective IP

program and IP program for each website can actually gen uh be uh be notified when the website changes because

uh when the response does not come through in the reasonable time out it will just fall back to the language

model and also during the process of the uh RPA process. It can look into the uh to can invoke the light language model

to look into the current step. So some

of the steps are simply the RPA steps which is fully automated and the remaining steps are still handled by the vision life language model to cope with

the dynamic content. For example, we check the weather. If we use a traditional uh fully uh uh agentic

process of the screenshots automation and need is nine L model calls and 47 seconds

and after acceleration, it only takes 10 seconds. And if you want to book a

seconds. And if you want to book a flight on the official website, it is also five times faster. And we can see that there are four RP here which means

RPA are just accelerate the fixed part including the filling the forms but navigating through the website and

understanding the flies to select the best flight according to the candidates.

This part is still handled by the main L language model which means that we can do all kinds of things only accelerating

the critical path and also accelerating the repeated steps but also give the flexibility have the still have the flexibility to handle with dynamic

content.

And now we come to the second to two case is to visualize and parse the agentic logs. We know that we we have

agentic logs. We know that we we have the agents execution uh trajectories for these kind of agents. It includes the

two calls to call responses includes uh communication with the users as the user assistant messages.

All these kind of things needs to be visualized into a trajectory into a into a conversation history for better understanding and debugging.

But uh we know that the tool we have many tools. A sort of a agent may may

many tools. A sort of a agent may may have tens of tools or even hundreds of tools. They often also have sub agents.

tools. They often also have sub agents.

They can invoke other agents. For

example, our orchestration agent invokes a deep research agent. It invokes a phone call agent. It invokes a computer use agent. These sub aents are very

use agent. These sub aents are very diverse. They have very different

diverse. They have very different parameter formats and also the two recent resource formats are often different and because the agent there are many

people many developers developing the new code the formats are constantly changing.

So if you have a for example web front end engineer just parsing all these kind of logs and do the visualization this is very

fragile and very hard to keep the pace with the uh development of the agents and tools.

So we take take a approach of self evolution. We let the front end to

evolution. We let the front end to automatically report the passing failure to the agent and uh when we have this passing failure we generate a new kind

of code based on the example and then we need to verify that code because it related to the user interface related to the GUI. So we need to use a virtual

the GUI. So we need to use a virtual browser to check if the new passing code can correctly parse that. And more

importantly, we use a vision light language model to check whether or not the visualization effect meets our expectations of the original logs. The

vision light language models are very effective. They can just take the

effective. They can just take the original log and take your code and take your visualiz uh

effect just a screenshot of the visualization and determine whether or not it is a correct implementation

and if the implementation is correct the ping code will automatically be updated and then we come to the third two

is is also a a kind of agent self evolution which means automatic the diagnosis of the production system issues. For example, we have the

issues. For example, we have the production system of agent. We have many cases online. We have hundreds and

cases online. We have hundreds and thousands of cases online and manually inspecting the problems inside that is incapable. It's not possible.

incapable. It's not possible.

And now we can use the execution trajectory. We have them and use the lat

trajectory. We have them and use the lat language model to take input as a safe system architecture documents and the production requirement documents to

automatically analyze the act execution flow and then generate the regression test cases for each problem that identified and for some important work

items. We can read the scrum uh dashboard and create a work item and create a test cases and the session uh

that is means the agents working session for the manual inspection of the developer.

In theory we can uh go one step further.

We can from the test case and worker item and form from the original problem report. We can simply create a pull

report. We can simply create a pull request and even review the pull request run text cases fully automatedly. But we

haven't done that yet. We still rely on the humans to use the cursor cloud code these kind of agentic coding programs to work on the working attempts. But uh I

think one day it will come true. Just

you let the agent to do the self evolution to localize and find the problems from its own execution logs.

Generate the test cases and working items and find the problem details and then finally to generate code according

to this and do the regression test. And

if the regression pass is passed, the agents modification of codes is automatically put online.

If we can make this come true then the agent can really do the self evolution and there will be no human or the human

will only needs to instruct the process and do some necessary re reviews of the pull request to maintain the code

quality and the design.

So finally we come to the summary from the post training to the in context learning to externalized learning.

In the second part about agent learning from experience we talk about the first paradigm of first post training which means through the refor learning or

supervised fetuning to perform the parameter update.

And the second paradigm is in context learning which is in theory the soft update of the attention mechanism during the inference is a low rank update to

the parameters and uh uh in context learning had pros of fast adaptation and no training but uh it is not good at representing

knowledge uh that is hard to represent uh through the text. And the third paradigm is externalized learning where we have knowledge base and tool

generation. For the knowledge base we

generation. For the knowledge base we have the text to based representation through summarization through the raw context through re tree based summary

and agent based summary. And for the two generation we have the code we just codify the processes to through two generation to achieve the efficient and

reliable automation.

So uh so many people will just ask why uh so why the three paradigms coexist.

So actually the three paradigms are suitable for different kind of purposes.

The first post training is better to representing the people uh from uh information that is hard to represent into text. For example, how to do u

into text. For example, how to do u speak like human uh like what we did inside uh the first part the real time

part uh about the for example the VA model to operate uh this uh the TTS and ASR models the light language model to

sync how to sync the fast thinking and slow thinking how to do the thinking while listening and thinking while speaking these kind of things are better um through the post training to

internalize the knowledge and internalize the procedure into the model parameters. This guarantees the latency

parameters. This guarantees the latency and uh this makes sure that you can use a very low latency without going through the checklist in the prompt uh one by

one and for the in context learning is comes very faster and it is very good for the uh factory data because the factory data in in context learning is

more accurate in for uh for doing that training that into the parameters and for externalized learning It is for the long-term memory uh in uh

in contrast to the in context learning through the knowledge base and also you can automate some processes uh via code through the tool generation. So the LM

does not need to know about the details of the tools itself but it not only needs to know about what the tool does the description of it and it is a level

of abstraction. So if you when you call

of abstraction. So if you when you call this tool or when once you call the sub agent the sub agent will just take all these kind of details for you and the

main agent will reduce this mental burden by uh adding another layer of abstraction that is a two call

and uh actually externalized learning is beyond makes a way to let us move beyond the limitations of attention. uh because we

have par par meterized learning by training and in context learning we have talked about that and uh is externalized

learning have fundamentally solved the three major problems of transformers.

The first is hallucation and reliability because external knowledge base could provide a verifiable and precise source

of truth and also it but uh it tackles the inefficient learning problem. uh

because uh if we tune this into many parameters, we know that the parameters uh uh need a

lot of time to train and during inference time we need to go through all these parameters. Even if we have

these parameters. Even if we have mixture of expert, we need to go through a large portion of the experts. So this

is a still very inefficient way for retrieval.

But if we use a knowledge base or tool set, we have talked about technologies uh to navigate through the large tool sets and navigate through the large

knowledge bases with with a very efficient way.

Uh actually in the last uh two slide I want to talk little bit about scaling law. First slide I want to talk about

law. First slide I want to talk about the pre-training to reinforcement learning. uh as we know that uh uh the

learning. uh as we know that uh uh the bitter lesson uh from Rich Richard Satin uh it says one word that we want AI agents that got discovered like we can

not which we contain what we have discussed what we have we have discovered building our discoveries only makes it harder to see how the

discovering process can be done and the pre-training uh is based on predicting the next token nest token prediction and in the phase two we have reinforcement learning that is propose training

through interaction with the world. Uh

so the phase one means uh we have AI agents that contain what we have discovered and in the phase two we have reinforcement learning that can discover

like we can agents in the phase two can allow the model to move beyond the passive learning to actively explore the world

to interact with humans the internet the website computers using the graphical user interfaces and also it can interact with physical world in the

next step for example the controlling the robots autonomous vehicles for the humans the first step we come to

the real time voice so this learning method allows us to build a second curve of the scaling law to learn from the

successes and failures of interactions however for the both phases is pre-training or reinforcement learning

won't have a limitation of the transformers a bottleneck it means the precises because we we know that the transformers

suffer from annotations and information confusion. So if you want to precisely

confusion. So if you want to precisely memorize this kind of dynamic and specific details for example the about

each person or each business or each region it is very hard.

So the future path maybe it is externalized learning which may continue the scaling law. Actually in the business lesson article of rich satin

there is a one sentence that there are two methods that seem to be scaled arbitrary a search and learning.

Actually external systems externalize learning is search and learning. By

learning it means to learn and summarize the interaction experiences into the knowledge and code. Code is tools and knowledge is text represented in the

knowledge base and summarization.

And search corresponds to the searching the external knowledge bases and also the two libraries either through the

agentic search uh through keyword or through the file system navigation or through the vector searches. All these

kind of searches is very highly scalable because we have 20 years of experience on how to build a highly efficient and

highly scalable search system.

So in this way we can have a precise and structured knowledge base to summarize this use the sort large language models to summarize all these kind of

structured experiences into the structured knowledge.

uh we this is very important because before large language models this kind of high quality summarization require expensive domain experts. So there are a

lot of expert systems several decades ago and they are based on human written rules but they do not scale. They do not scale simply because there are no no so

many domain experts.

But today there are so many large language models there are infinite number of nomen experts.

Actually if you look into the large language processing model uh field NRP actually this is the only sub uh uh the

only traditional NRP topic that still survives in the language model area is summarization. uh the ne for sample name

summarization. uh the ne for sample name entity extraction is already old because uh we do not need it anymore but the

summarization is still alive and it may be alive for more than several decades later and apart from the knowledge base we

have code as a universal knowledge representation for example if we can do the summarization using a code representation

to ensure the structured database and ensure the structure and high quality and also for the tools it can generate

this tools to represent some procedural knowledge we have in the experience.

The code is not only a just a tool for the programmers but very important a universal kind of cap capability of

sort language models. It is a universal structured data representation. It is

precise verifiable and composable. So

this is very important to use the code as a knowledge representation.

So in a future kind of system we expect to see the co-evolution of large language models with external externalized learning including the

knowledge base and tools where external knowledge base manage the precise and dynamic kind of knowledge and the model itself performs the general reasoning

and understanding capabilities and also for the domain it may have some uh kind of knowhow knowledge. some rules that is

uh needs to be internalized to the model parameters and final uh uh we talk about Pine AI.

Pinei is our company and we are looking for the full stack engineers who can build the SOTA also known as AI agents.

We have a philosophy that everyone's contribution to the company's valuation should be in the tens of millions of dollars because we can see the best

companies AI companies including open anthropic Google and also you can look at cursor this is kind of good aging companies

each company has a valuation of everyone's contribution more than 10 million of dollars so what We believe that in the time where we all use this

kind of AI assisted programming, everyone can instruct the AI similar to instruct the

normal engineers. So we no longer need

normal engineers. So we no longer need the engineers whose capability is lower than the soda land language models.

Actually more than 80% of code of uh our system is already using through the coding agents for example cursor cloud

code and so on so forth and we expect everyone to love solving problem hands-on uh we know there is a famous quote uh by Linus is talk is cheap show

me the code and someone is not many people not using the web coding and say that this quote should be modified to

code is cheap. Show me the talk. But uh

no matter what you use, use talk or use code, you need to solve these problems hands- on because everyone will be a combination

of architect and product manager.

It needs to you uh do the architect work to design the uh technical arch to to design the architecture of the system

and to instruct the coding agent to um finish the job and also it needs to review the code produced by the coding agent and also it needs to understand

the product. It needs to know know what

the product. It needs to know know what to do what is good and what is bad. The

evaluation here is very important.

Actually, everyone should have a good sense of determining the what is good and what is bad. Traditionally, this is only the task of a product manager. But

now every programmer, every developer now needs this kind of capability.

And we also need to a solid software engineering skills and understand principles so we can know what kind of things are best for the agents to do and

what kind of agent techn architecture is best. I think that using uh coding

best. I think that using uh coding agents uh fluently is very important because coding agents is one of the most

successful agents in uh this year and if if one is good at using coding agents, it knows the capability boundaries of the agents and the last language models.

It knows uh what it can do and is not cannot do. So it can uh use the make

cannot do. So it can uh use the make best use of the capabilities. It knows

uh then one if it can use the coding agents better it knows how to write the prompts how to write the cursor rules how to organize this context how to

manage the context that means how to when to do the summarization where to when to create a new context yeah so on so forth. So this is very important.

so forth. So this is very important.

Now our mission is to truly solve the users troubles a gassing sun by building all the leading voice agents and interactive agents that interact with

the world in real time and really learn from experience in a very effective and lowc cost way.

Uh so finally we come to the last slide.

There are two clouds of our agents and the first cloud is real-time interaction and second cloud is learning from experience.

Uh so we want to end this talk from the famous quote from the beta lesson. The

biggest lesson that can be read from uh 70 years of AI research is that general methods that leverage computation are

ultimately the most effective but by a large margin.

For the real-time interaction part, we propose a new architecture that is better leverages computation to do the continuous thinking while listening and

speaking. And we also have the

speaking. And we also have the perception layer and also the execution layer that works in conjunction with the

continuous syncing major light language model to produce a fully interactive and real-time streaming experience.

And for the learning experience part we propose three paradigms. The first paradigm is to buy the post training of

the models by parameter update. The

second is a in context learning through implicit parameter update.

And the third is externalized learning by using uh external knowledge

base and writing tools to let the agent evolutionize itself. The learning is

evolutionize itself. The learning is from experience especially is a externalized learning where knowledge base and the tool set is some general

method that can leverage a lot of computation to do the summarization and code generation that I'm ultimately the

most effective way of scaling law and keeping the scaling law to a next decade. Thank you so much for listening.

decade. Thank you so much for listening.

Loading...

Loading video analysis...