LongCut logo

Cognitive Robotics and HRI: The Importance of Starting Small

By UK-HRI

Summary

## Key takeaways - **78% of adult language is abstract words**: A study cited by the speaker found that 78% of language used by adults consists of abstract words—terms where the mapping between the word and its meaning in the physical world is not direct. This creates a fundamental challenge for robots using LLMs to understand human instructions like "make me a cup of tea." [11:04], [11:24] - **LLMs don't truly understand—Chinese Room argument**: The speaker argues that LLMs, like the person in Searle's Chinese Room thought experiment, manipulate symbols without understanding. When you match input symbols to rules and output symbols, you appear to speak Chinese but have no understanding—mirroring how LLMs generate text without comprehension. [17:37], [17:44] - **LLMs should be called 'large TEXT models'**: The speaker contends that 'Large Language Model' is a misnomer because language is more than text. True language includes semantics, pragmatics, and understanding within context—none of which LLMs possess. They should be called 'large token models' or 'text models.' [37:26], [37:43] - **Transformers learn via circular definitions**: The speaker demonstrates that pure symbolic systems create vicious circles: 'push' is defined as 'press forcefully,' 'force' becomes 'energy or strength,' and 'energy' loops back to 'strength of force'—a merry-go-round that LLMs replicate in their token-based representations. [14:51], [15:17] - **Child data vs. LLM training scale is incomparable**: A 10-year-old child has experienced roughly 3×10⁷ to 10⁸ tokens over their lifetime, while LLMs train on 10¹²+ tokens. Despite this massive data advantage, children are more competent in real understanding and language use—a paradox that challenges the scaling approach. [35:35], [36:16] - **Developmental approach: 'start small' for humanlike robots**: Referencing Jeff Elman's 1993 work, the speaker argues that small networks can learn complex structures through incremental training—starting with simple nested sentences, then progressing to complex ones. Maturation mechanisms (like growing memory capacity over time) enable learning that wouldn't occur with simultaneous exposure to all complexity. [41:48], [43:34]

Topics Covered

  • The 80% Problem: Why Robots Can't Grasp Abstract Language
  • The Chinese Room Shows LLMs Have No Real Understanding
  • LLMs Are Text Models, Not Language Models
  • Why Transformers Will Never Achieve True AGI
  • Developmental Robotics: Start Small to Build Real Understanding

Full Transcript

Hello and welcome to the third event of the UK HRI expert seminar series on methods and best practices in human

robot interaction research. I'm Claire

Asher, freelance science communicator and host of the robot talk podcast and I'm excited to be your host for today's seminar. We'd like it to be a really

seminar. We'd like it to be a really friendly, welcoming environment. So,

please say hello in the chat and let us know where you're watching from today.

This afternoon, we're very lucky to be joined by Professor Angelo Kangallocei from the University of Manchester who's going to be talking about cognitive

developmental robotics.

Before I give the floor to Angelo, I'm going to hand over to Dr. Patrick Halt House who leads the UK hri topic group to give us a very brief overview of what the group is all about. Over to you

Patrick.

Thank you Cla. Um yeah, welcome everybody to Angela's very exciting talk today. Um I'm going to quickly say

today. Um I'm going to quickly say something about UK hide topic group which is a funded initiative by the UK ROS network. So basically receives money

ROS network. So basically receives money from UKRI to um disseminate best practices and methods between human road interaction researchers. Um this

interaction researchers. Um this initiative is led by the University of Hartford by me um in collaboration with Everest University with Patricia Shaw

and Herit W University Daniel Hanes Garcia. Um yeah we are thankful for

Garcia. Um yeah we are thankful for getting that funding and providing the seminar to you. Um just a quick few words about how we are organized. Um as

I said we try to disseminate best practices amongst hri researchers and provide guidance for newcoming researchers in the field um and also provide a mapping of what the UK has to

offer in terms of hri research. Um we

offer this advisory group where people can engage with us. So we created a network of now more than 25 advisers from 179 institutions now um where people can apply for mentorship and

advisory. We have three mentorships

advisory. We have three mentorships going on already. Um so it's a really really vivid network where we can get to know other people. Um the second lag that we offer is these research

experience grants where you can come as an emerging researcher and do some experience at a different university for a short-term visit. Um go to our website and see if you can apply for these if

you're interested in going for a short stay um for example to our lab to other labs to um you know see how HR research works and gain some experience there.

Thanks very much and enjoy the seminar.

Over to you, Angelo.

So, I'd like to start by thanking, of course, the organizer, Claire and Patrick, but it's nice to also see indirectly connected and also coorganizing this event with Patricia

and Daniel in Edinburgh.

So, this talk has really two parts. I'm

going, of course, it focuses on my research. It builds on my experience and

research. It builds on my experience and it then will discuss three challenges in HR for future work. The first two are directly related to my work on language learning which is my co- research of

course language learning in robots which is necessary for communicating for communication between robots and people and then the third one will be a more

general discussion a bit of a I hope it will really start a discussion between people who are pro or against LLM in the

use in in the field of hri because it's a big debate eight and it's something that is with us and we we need to reflect on what they are what they do and how can we use them and I can which

shouldn't use them also so what's the context of my research which is really the context of a lot of hri areas I primarily look at social robots for as companions for older people I've done

this in some European projects and also part of my ERC project and other UK funded projects more recently I started to work on uh children as tutors

for language learning also for mathematics for numbers concept we'll talk about this later and of course cobots you'll see there on the right top right it's an important area and also

the bottom highlight not only the aspects of linguistic communication but also tactile and non-verbal communication which is important this is part of a a tactile system and I'm

collaborating with a company in China that develops this for humanoid robots now but let's start from an old video

uh which is a project that was led by Paulo Dario and Philip Po Cavalo few years ago. We were partners when I used

years ago. We were partners when I used to be in Plymouth and this was a collection of three robots all working with older people. So let's look at the video which not it's not the video the

results but it's the video of the vision that help us to develop the project. I'm

going to skip. It's a long video. So I'm

going to focus on two or three steps.

So there are project is called robot era. There are three robots. One is

era. There are three robots. One is

called doro domestic robot.

This is the robot that lives in your own flat apartment. He will support you by doing

apartment. He will support you by doing some cooking. You can make a cup of tea.

some cooking. You can make a cup of tea.

We'll go back to cup of tea in a moment.

And you can also talk to the robot using speech or using a tablet to order your food, order your medicines from a site.

The distinctive value of this project was to have three robots collaborating together to support all people. So

domestic robot doro and then you have auto outdoor robot which is a robot that works outside. Here it is. And it will

works outside. Here it is. And it will do the shopping for you. So you ask your domestic robot simulation and he will go and get the point

and he will walk around the town pizza where we did the experiment this

will go to the house like it will connect the coffee from this outdoor robot and then later the do the condominino

robot will go to the flat and give the food or the medicine to the in indoor domestic robot. Right? So distinctive

domestic robot. Right? So distinctive

value of this project was collaboration and of course very important this was the vis the video division the results were more limited than what I would have shown here. This is the ideal scenario

shown here. This is the ideal scenario and more importantly we decided together with the users users were the older people future users but also carers

doctors and geratric hospitals families 11 tasks on which to focus of course we can't expect the robots to do everything and the reason I'm showing this video is

because I would like to focus on a simple request that we can give to a robot and the solution that now systems like LLMs and VAS visual language action

models give us to uh use these robots in a more efficient way. So although I'm Italian, I moved to the UK almost 30 years ago and the reason they sent me away from Italy was because I don't like

coffee. I'm one of the few Italians who

coffee. I'm one of the few Italians who doesn't like coffee. So not allowed to be there. I like tea and they sent me to

be there. I like tea and they sent me to Britain which is meant to be the tea country which is not really. I'm sorry

for my British fellow academics because the real good tea comes from Asia and many other countries.

But regardless of this, so I like tea and I'm getting older. So maybe one day I will be a user in one of these projects and I would like the robot to make me a cup of tea. So if this robot,

let's say a pepper robot or the G5, we call this robot the user in the video. I

ask the robot to make me a cup of tea, the robot can rely on an LLM or VA to help the task. So if I ask the robot which relies only on an LLM pure

textbased system, this is what I will get. This is actually the prompt, the

get. This is actually the prompt, the prompt that I used on a chubby version maybe a year ago. Make me making a cup of tea and I get a recipe, right? I get

instructions. Of course, if you connect this LLM to perceptual systems that see where objects are, where the teapot is and so on and action primitives, you can

in principle get them to act, which is what what people are doing. But the core of this experiment almost like a thought experiment. It's not really an

experiment. It's not really an experiment is to know or to reflect on the fact that the robot is giving me a recipe for me to make it or he will try to you will see in a moment to try to

solve the problem by mapping each of these words into some actions and some perceptual properties. But this is not

perceptual properties. But this is not easy. Why? Let's go now from pure text

easy. Why? Let's go now from pure text which isn't enough really for a VA or for a physical embodied robot acting in the real world in my apartment. Let's go

towards multimodal training stimula, multimodal examples. And again, I

multimodal examples. And again, I Googled about a year ago making cup of tea and I was lucky enough for to find that there is a recipe from the British

standard institutes that tells us what is the right way to make a cup of tea at least in the UK. And you see here there are words which are again are very

closely linked to what an LLM will give us. There are images they are important.

us. There are images they are important.

The robot needs to see the pot, see the tea leaves, recognize the water changing color when the tea is ready and so on.

This is so very good. And there is here a symbol that is about I guess time remember. But what I want to focus here

remember. But what I want to focus here on is the fact that about 80% 80% is an approximate approximate number. 80% of

the words in the instructions but actually some of the images themselves we know this is a timer, right? But how

do you tell the robot this is a time?

Also, how do you tell the robot what time is? This is a non-trivial concept.

time is? This is a non-trivial concept.

So, the highlighted words are words which we call non-conrete or abstract words. So, replace it's a motor word but

words. So, replace it's a motor word but replace applies to many different actions. So, the mapping between if I

actions. So, the mapping between if I say cup or if I say push is relatively straightforward. There is only one type

straightforward. There is only one type of action or or one representation for a cup or for the pot with some variations.

But replace is hard. The the article and leave to brew. You see 80% of the words.

Why do I see say 80%? Because in a study on children and and adults, 78% of the language used by adults is actually

abstract words. Words where the mapping

abstract words. Words where the mapping between the word itself and the meaning in the physical world. But we you will going to talk also about the internal

world. Sensory motorto means external

world. Sensory motorto means external physical environment but also internal states. If I feel sad there are internal

states. If I feel sad there are internal states which I also recognize and perceive in my own way.

But the challenge here is that if we want to use these LLMs to and link them to robots that are non-trivial issues like for example teaching a robot the concept of one two

and three because if I says for five minutes what's five 1.5 grams or how do I know what's 1.5 in terms of a robot so what's the solution the solution

nowadays it it seems to be in robot learning a robot control create an visual language action model there is one extreme dream. This was the early

Google RT robot transformer solution where you have a like a system that transforms an image into words. Words go

nicely as a prompt to an LLM. It asks

for a recipe which we saw before and then ideally the output should really map into physical object that you can manipulate or action that maybe primitive actions actions that you can

perform. But we said that there are

perform. But we said that there are actions like replace which are nontrivial to represent and to learn and to execute.

But also let me reflect on on the fact that CH GBT 22 right four years ago or three and a half years ago the big revolution in in the field of AI and

machine learning but 1966 so exactly 7 60 years ago we bound at MIT developed Eliza the first chatbot of

course noting of the complexity of the LLM but let's also reflect on moments in history which are important also because today in My second part of the talk I'm

going to go back to some work in the '9s which I think is very insightful for our future research in AI robotics and hri.

So if I train a robot pre-train a robot with a GPD with a transformer or I just use a dictionary this is an example from

few years ago uh de he uses the example from a dictionary not from a training but you can imagine that an LLM can easily do this. So I ask my robot what's

the meaning of push. The robot uses an LLM. It goes and searches from some past

LLM. It goes and searches from some past memory of the reading of this Webster American English dictionary and it will tell me no problem. I know what push is.

So the robot will answer push is to press forcefully against to move which is a relatively high level definition but that's what we understand and we use right. So it seems that the robot really

right. So it seems that the robot really knows about pushing because we know pushing is to apply some force right to move an object and this is what the robot says by using a dictionary

definition. But now let's challenge the

definition. But now let's challenge the robot understanding of this concept of push by asking for example what it means force and here force and push are chosen

on purpose because they are very concrete right push is a primitive that the robot can apply with a hand a mo just a motor one degree of freedom can

push an object on a table force robots have motors and we have force center so it should be straightforward for the robot to understand between quotes these words but here because It's all a

symbolic system. This is called a model.

symbolic system. This is called a model.

A model is a representation based just on symbols or text which is what an LLM does. So I ask for definition of force

does. So I ask for definition of force and because of this dictionary LLM solution I get energy of strength which is good. We understand this makes sense.

is good. We understand this makes sense.

Now what's energy? And energy is defined as strength of force which you see it's a bit of an issue. So force is defined by energy and then energy is defined by force. And if you continue to go into

force. And if you continue to go into this direction, you can see that all these vicious circles, they called the merrygoround by Steven Harner. You see

there an image on the top right of a merrygoround game, right? You start and you end up the world. So it's a salad of worlds, right? Which is what maybe LM do

worlds, right? Which is what maybe LM do or maybe not. We can debate this later.

So I hope this shows you that there is a bit of an issue if we rely on a pure a model symbolic system which is a textbased or token based LLM we should

say. Now this is also related to a

say. Now this is also related to a philosophers reflection on the Chinese room experiment. Now I like to call it a room

experiment. Now I like to call it a room experiment. So let's watch the video. If

experiment. So let's watch the video. If

you have seen my talks you recognize the video. For those who are new, let's

video. For those who are new, let's watch the video that nicely describes this Chinese room experiment.

In the 1980s, the philosopher John Surl was chewing on this problem and he came up with a thought experiment that gets right at the heart of it. And he called

this the Chinese room.

The experiment goes like this.

I'm locked in a room. Outside there's

someone who only communicates in Chinese.

She writes out some questions and then posts those to me in the room.

Now, I don't speak Chinese, but I do have these books and they give me instructions on exactly what to do with these symbols. So, I look in the book

these symbols. So, I look in the book and and if I can find a match to the symbols, then the book tells me exactly how to respond. So, I can look up this

response that matches. I'm stopping this just

that matches. I'm stopping this just let's call this a prompt right you get a prompt from the lady in writing you match it to your memory and then you are going to give a prediction of the next

words so now I can post this as the reply to the message I received when our Chinese speaker receives the

message that makes perfect sense to her As far as she's concerned, we're having a conversation in her language.

Just by following a set of instructions, I can convince somebody on the outside that I speak Chinese. And if I have a large enough set of response books, I

can have a conversation about anything.

But here's the important part. I the

operator do not understand Chinese.

So very important I the operator David does not understand Chinese. He also

says if I have a big enough set of examples and words and dictionaries which is what an LLM can do right huge data set I can create a conversation

which is what we get. So, but we accept as true the fact that David does not understand any of the linguistic interaction happening between the prompt

from the Chinese speaking lady and the answer that she gets which are meaningful to her but not to David.

Now, if you are not convinced that David doesn't understand, I'm going to do this exercise. If you seen this, uh, ignore

exercise. If you seen this, uh, ignore me. But if for those who are new to this

me. But if for those who are new to this kind of talk, let's it's a language that I know it's not Italian. I'm going to ask you here on top quantich

and to do this I'm going to pre-train you quickly almost fot learning let's say with a dictionary so there are words in this physical I created a story right

where there is the word like push energy strength so picheti anatar and so on you see some words are

repeated they change then there is a book like in the video with two pages for each sorry for each time you browse I said the two pages you get a different

question so here to simplify the the exercise let's do only two possible questions

so I'm going to ask you so please let's see if I can see the chat I don't know if you can let's try

to type in the chat uh the answer to my on top.

I help you by showing you that I highlight what I'm doing, right?

Let's see if anybody's replying.

Ai, very good. Can we change a to something? This number doesn't mean

something? This number doesn't mean anything to me. So, let's answer properly in the sense. Let's not cheat if you know uh some of the words. Let's

try to speak this strange language.

Excellent. So, very quickly, Christina Palro, you have an Italian name, so you are cheating a bit, but it's fine.

Shang. Okay, that's okay. You are not Italian speaking. It's not Italian, but

Italian speaking. It's not Italian, but similar to Italian. Okay, but very good.

Okay, Christina was right.

So, what is Christina doing? I'm doing

here in this remember symbol mapping question prompt.

So you see there is a matching there. So

I'm matching them. X is a variable that can change. So in this case X is pott.

can change. So in this case X is pott.

So I'm highlighting the question. I

found the page two in the my book. I

highlight in my brain in my if you want transformer the word pichota because this has been activated but you will see that later I need to activate also the

word ani and now I start to build the answer la x a x is pota so la pich ai excellent now we need to substitute

continue with two more words a and an a b x are variables they are there on purpose to make you work a bit that's why when I got the reply I ai was a correct answer. But let's go a bit

correct answer. But let's go a bit deeper.

So how do I do a ani? I activate an I go into my memory my transformer attention mechanism. I I find an twice right. So

mechanism. I I find an twice right. So

where is the what is the context that is most strongly activating the words an closer to this? It's clearly the word in

the middle, right? Seti. So seti is a high coactivation in my attention transformer. So the answer would be a

transformer. So the answer would be a set. So seti is because an pichota are

set. So seti is because an pichota are coactivated. So what we did is just pure

coactivated. So what we did is just pure words, meaningless words. Again, if you know Italian, 50% maybe of the words are

similar. If you don't know Italian

similar. If you don't know Italian then if you know Spanish but let's pretend you are Chinese like my postto Shanguang and you have very little

knowledge of Italian but ignore the the meaning this is the whole point and talk to me in Sicilian.

So let's reflect on this and what I'm going to show to you that up to now you are the robot I'm the person of vice versa there is really no understanding

okay because no one knows what la pichota means so how do we start to help the robot to understand my language this is actually Sicilian dialect should be

more precise this is the dialect from west Sicily where Palmo is I'm actually today in Katana which is uh east Sicily where some of the words are actually

different. So pichota is not in the dial

different. So pichota is not in the dial in the dictionary of katana sicilia. So

how do I start to get the meaning of words? Let's go to philosophy quine.

words? Let's go to philosophy quine.

Quin wants to understand how we can build and represent meanings and learn meanings from our experience with the world. And in a famous book and paper he

world. And in a famous book and paper he says the word gavag guy. It makes a another thought experiment. Let's assume

we hear the word gavag guy. If we go to the Amazon, we've discovered a new tribe which has never been in contact with us English speakers. We hear the word gavag

English speakers. We hear the word gavag guy and we need to guess the meaning of gavag guy in the local language. And if

I again were to actually let's do this because you were very collaborative.

Let's go in the chat. Please type the meaning the possible meanings. This is

very important because the whole point of coin is there is no single meaning.

The possible meanings of gavag guy. So

can you please type? I need three or four examples. Let's see if you are

four examples. Let's see if you are prototypical. Feeding. What else?

prototypical. Feeding. What else?

Only feeding the rabbit. Very good food.

Excellent. I think I got what I wanted because I first I need a rabbit. The

majority of you if we did some statistics, thank you for the second rabbit. So, okay, you can stop there.

rabbit. So, okay, you can stop there.

Very good. So, the majority of you said rabbit or would have said rabbit if I did a proper study. Why did you do this?

Because when we were babies, we were using a mechanism and we are still you are still using a mechanism called shape bias.

Two-year-old children when they learn words they have a bias like a cognitive bias a default option that says if I hear a new word gavai then this must

refer to the main shape the most salient shape the most visible active shape which is rabbit. So the majority will say first rabbit but then of course food

carrot girl is right but also I like the answer grass even a background shape is actually valid now let's continue so you

don't we don't know the word okay we don't know yet what it means let's keep the uh chat open now can you tell me what's the meaning of gavag guy in the

second experience that we are having in this Amazon forest so what are the possible meanings. Remember there is

possible meanings. Remember there is more than one. So it's not rabbit anymore.

What else is there? Can I have some answers, please? There is no food. There

answers, please? There is no food. There

is a puppy. Very good. Pet, excellent

animal. Excellent field. And he can still remember big grass. Very good.

Well done. So Quin says that it's very hard to know if the word refers to the whole animal or parts of the animal because I didn't focus on this. It could

be the eyes of the robot, right? It

could be the ears, sorry, of the robot, of the animal there, the the dog or the pet. So, let's go back to Sicilian. I'm

pet. So, let's go back to Sicilian. I'm

going to teach you how to speak Sicilian by doing something which we call grounding. So, what's the meaning of

grounding. So, what's the meaning of pichota in Sicilian or I would say what are the possible meanings? And imagine

most of you will say girl which is correct but also it could be happy young person hair cheek you know red cheeks and so on all

these words are correct. The second

mechanism in addition to uh shape bias is called cross situational learning when I show you the cat sorry the cat the rabbit and the dog you are comparing across situation. So let's continue

across situation. So let's continue other Sicilian words. Oval is very easy right because of the shape bias there's one shape we call it the egg shape so

oval shape so oval means egg cement it's a bit trickier it means seeds maybe means food maybe means I don't know two but what matters here is the top words

are concrete words where the shape and the word and the meaning are kind of one to one correlated like egg and girl now

let's reflect on abstract words. So

what's the meaning of pika? Again I

don't we don't have time to go into the chat. So the most of you could say three

chat. So the most of you could say three many eggs you know more than one egg few eggs they could you could say food and so on but let's do cross situational

learning. So it's not three anymore it's

learning. So it's not three anymore it's not eggs anymore. It could still be food. Right now if I keep doing this now

food. Right now if I keep doing this now cross situ situation learning makes me reflect it's not food because asai and pika is used for food and also smena it

can I mean they all match so I guess you can guess the main difference is few and many which is correct and finally manari what's manari it's a sicilian pizza it's

Mr. being a steady bear. But if I show you the video, the action, you can guess it's about the action. We call it verb, right?

So now that I'm teaching you Sicilian, right? Or I'm teaching my robot to speak

right? Or I'm teaching my robot to speak Sicilian because I'm getting old. I go

back to my original language, right? And

I will want to order my cup of tea using Sicilian words. So first challenge

Sicilian words. So first challenge that we have is how do we let robots understand words

going beyond a simple a model symbolic representation. First for concrete words

representation. First for concrete words which is kind of easier and there are many solutions and then we move towards abstract words. So the way we do it in

abstract words. So the way we do it in our approach called developmental is we model in the robot the learning steps of

language learning. So let me show you an

language learning. So let me show you an example from a study by Lucar Jolie during his maruri PhD with me. This is

here an experiment on language learning by Chen Yu and Linda Smith. Let's look

at the video collected during the experiment. a

camera on the forehead of the rob of the child on the left and the forehead of the parent on the right.

I'm going to show you this video more than once.

This is what the child sees. This is

what the parent sees. Very different

world, right? And when we look look at now I want to stop in a moment.

Okay, I want to stop here for a purpose.

Can you see how very different is what the person the parent sees the parent sees the child three objects right the child sees one object not only the child

has something called shape bias so I here I have one shape let's call him microphone so if the parent says this is blah blah gava guy then for me as a

child this is very easy we call this scaffolding we are simplifying the learning task by helping the child to focus on the core target object which is when the child looks at it or the parent

can actually take the object move it closer to the head of the child and then name it. So look at it some experiments

name it. So look at it some experiments combining passive teaching of words and active teaching.

Let's look at the robot video in a moment.

So passive.

This is a bottle.

Okay.

More passive. You see what we see as teachers on the left and what the robot sees.

Now the robot sees one shape for cup because the parent is producing this scaffolding as okay. So this shows that we have done a lot of work on this and

many others. It is relatively easy to

many others. It is relatively easy to teach the robot the meaning the grounding of concrete objects. But how

are we going to the real challenge is how do you teach concept which are abstra like numbers as an example. The

way we do this is again by looking at child development and we know that we use embodied strategies like finger counting or gestures and this is what we have been doing in the past.

This is also finger counting. On the

left you see data from children and experiments. So we really copy the setup

experiments. So we really copy the setup of a child experiment and we embed this into the strategy for the there is a some neural network behind this and we

are continue to work on this.

I got two plus two.

Okay the robot will do this. We are

continue. This is my current ERC advance project and I have three post talks.

Shang Wang is here and he's working on more experiments on the embodiment of numbers for example plus other aspects of abstract words. So challenge one and

two is about learning words which is what I do. Now challenge three is a bit more theoretical and practical from a wider machine learning point of view. So

any use of a robot nowadays in hri will imply AI. Okay? It could be a VA, it

imply AI. Okay? It could be a VA, it could be an LLM, it could be an old-fashioned uh I don't know forest uh random forest approach. What matters that we are going

approach. What matters that we are going to use some kind of AI or or maybe symbolic AI. So let's reflect on

symbolic AI. So let's reflect on challenge three which is the one that I hope will start the debate. So this is about LLM. So let's look at the positive

about LLM. So let's look at the positive first.

So there are huge opportunities. We have

a huge opportunities and one way first reflection is I cannot believe I angel cannot believe they are working this well these LLMs you know no grammatical

errors and you it produces very interesting things and for me the hallucination is not really an issue in the sense that hallucination which we call generalization in neural network is

actually part of the feature of this and I'm showing here Tony Belpame my friend Tony we were colleagues in Plymouth because he's an enthusiast about LLMs and robots and dialogue. So positives,

how can we use an LLM over VLA for dialogue? That's the first obvious

dialogue? That's the first obvious answer. A chatbot, right? Chat GB

answer. A chatbot, right? Chat GB

chatbot for vision and speech. We can

optimize speech comprehension. What we

can do now with whisper all the systems is unthinkable up to a few years ago in terms of robustness. Of course, nothing is perfect but strong. Vision is again

object recognition is quite easy. Um

I've done some discussion with a colleague on using for autobiographical memory. You can have like a you can use

memory. You can have like a you can use ra this kind of search within a database which could be photos of your previous experiment. You could save all your

experiment. You could save all your images of your past events and then you do a search with a ra on your past memory or you can just build a knowledge

base by from the data. For me an LLM is not a language system. is a knowledge based with a datadriven approach because you have this representation of semantics but semantics without

understanding this is important then if you have VA visual language model and you have multimodal aspect representation is this grounding maybe we are moving towards grounding with wheel but this is part of a research

question but let's talk with the positives because when people get old like me we get grumpier okay so we are not never happy actually I'm I'm an

optimist But at the moment with LLMs, I'm a bit worried and you will see why.

So let's reflect on the issue of training and using an LLM data access to data. This is some numbers given to me by Navili. Navidi is

the Italian NLP researcher who built the Italian language LLM called Minerva. So

LLM you need trillions of tokens. A good

reader reads less than a billion. So

from trillions to less than a billions 200,000 per week approximate numbers.

These are numbers from a paper in trends in cog science. You see you know the scale is of course nonlinear in the vertical axis you see the human bound

human region in terms of data words versus a GPT3 which is now prehistory of transformers right you can imagine it.

So we are very very far in terms of complexity.

So let's focus on children because you can see that most of the work I do is about children and it's about learning bottom up learning. Start this small is

the subtitle. So CH GPT3 was trained on

the subtitle. So CH GPT3 was trained on five 10 to the 11 tokens more complex would be 10 12 but really this number grows continuously. A child

is 10 to the 6th per month and then 20 year of age you have done three * 10 to the 7th or maybe to 10 to the 8. Again,

non-comparable numbers. I always say a 10-year-old child is more competent than a LLM in terms of not only production, linguistic production, of course, real

understanding of things. And we are not even talking here about so from data we go to the structure of or the properties of these models, these transformers. the

long-term potent the long-term memory I should say LTM memory is too strong in the sense that you know you overlearn on the data

and then the short-term memory the working memory is actually again too good although this is mapped into a sequence of activations remember and

transformer is not a recurrent network is a chain of transformer events which we read let's say working memory so from one point of view there is no working

memory in a transformer. One way to see it or by combining a chain of events where you adapt in the prompt the previous output of the system then you

get some kind of sequential representations.

So let's some now reflect on what I are my personal issues from a with an LLM LLM specifically token only not VLA I like VAS more

me as a cognitive scientist I for those that don't know I study psychology I approach robotics as a way to understand cognition of course as a way to build artificial system and robots but for me

as a way to understand cognition so what are my issues first they shouldn't be Le language L models large language model should be called large text model

large token models.

Erh so this is already an issue. There

is no language there. There is text and language is more than text because language is semantics pragmatics.

Pragmatics the struct the understanding of the meaning within the context within the task and the goal that you want to achieve.

There is no understanding in an LLM. If

I ask Quantan Pichota to an LLM which has been trained on the Sicilian dialect, I get a nice answer but the system has no understanding. The same

way if I do a Chinese room experiment with Sicilian, which is what we did, I as the operator, you the operator do not understand Sicilian. This is what David

understand Sicilian. This is what David said. I the operator do not understand

said. I the operator do not understand Chinese. So no understanding. We can

Chinese. So no understanding. We can

debate about this. There are

philosophical theories that there is understanding the system. But for me, you have no understanding of Sicilian.

If you can answer to me in Sicilian when I ask you the question simply by looking using this mapping exercise, this simple manipulation

exercise, there is no embodiment. There

is no body. Even if the system extracts statistical similarities of body parts, there is really no understanding within the system. And we can debate this

the system. And we can debate this later. And I there is always somebody

later. And I there is always somebody who disagree with me which is nice. This

is the whole point of doing science.

GPS transformers are for me cognitive scientists the most non-cognitive neural network because there is no time no recurrency. It's unlimited attention.

recurrency. It's unlimited attention.

Remember attention is all you need. The

paper that proposed transformer for the first time. But attention is all you

first time. But attention is all you need. We say in human cognition

need. We say in human cognition selective attention is all you need.

When I read a bit of a book, I pay attention to one word and few words before and after. I don't pay attention to thousands or millions of words. a

transformer which with a context size and input of I don't know 10,000 100,000 tokens is paying attention to each individual token and the relationship

between each token and all the other remaining 9,999 tokens like you are paying attention to everything at the same time which is nonhuman it's superhuman from a negative

point of view I believe there is no memory no time we discussed this before And then I will I show you the data right the number of data that

you need to to pre-train a transform is not compatible with the little data that you need for other humanlike systems. And then we can debate this later.

Forget about AGI. You will never get AGI with the transformer. I know Elon Musk and friends and some Alam they want your billions. So they will tell you the

billions. So they will tell you the vision of AGI because they need them billions. But if you want to give them

billions. But if you want to give them billions it's fine. invest the money in these companies. and no one giving you

these companies. and no one giving you any financial advice but not scientifically speaking don't think you will get AGI from that but again some of

you hopefully will disagree with me so what's the solution the solution is be going beyond large data and what can we observe in human cognition there is

pre-existing structure the link between the retina and the brain is only directly linked to the occipital right back area of brain. So there is structure which has been evolved with

evolution. There are maturation

evolution. There are maturation mechanism and my memory capabilities evolve the the wiring of the brain. The

pruning of the brain happens only in specific critical periods. We are

multimodal system. We have pre-existing knowledge. We are social interactive

knowledge. We are social interactive system. That's how we learn and how we

system. That's how we learn and how we select the input. And we have of course realistic limited short-term memory and relatively large but not unlimited

long-term memory when we save events. So

small not large data models is what we should be looking at from a cogn point of view so that our robots for hri will be more humanlike because at the end

they need to interact with humans.

Right? This is the subtitle of my talk which actually a subtitle of a paper by Jeff Elman 93 learning development neural network the importance of starting small. I spent a

year on my PhD in 9495 in San Diego with Jeff Elman. So it's nice to go back to

Jeff Elman. So it's nice to go back to this and why is this important? I'm

going to tell you about this paper because it's a very toy model but it's a toy model with a very impactful set of

pred no not prediction statements and methodologies first he is the guy who used I don't know if he's the guy who used it first who invented the concept

of next word prediction in the original paper 1990 find the structure in time there is a test a task on learning the next character of a word and then there is the learning in the next word in a

sentence. So word prediction is what he

sentence. So word prediction is what he used which is what transformers use also he selected a series of 500 I think

about 500 sentences with nested grammatical structure. So cat chases

grammatical structure. So cat chases dogs very simple noun verb object and then nested boys who chase dogs see

girls. So boys see girls main sentence

girls. So boys see girls main sentence and then there is a sub nested sentence who chase dog. The boys and the the boys who chase the dogs. These boys also see

the girls and it start it took an arbitrary Elma they are called Elma network or simple regard network an arbitrary network and it started to use

three methods to train the network. If

you show all these 500 sentences with different level of complexity, we call it a do data this network arbitrarily small let's say will not learn this

which is fair enough. Of course you can increase the the size of your network more layers more units and it will learn. But the whole point is let's

learn. But the whole point is let's choose a network which cannot learn. And

then let's do something else. Maybe

let's go back to our data which first train the system to say cats chase dogs.

Mary feeds John and after you learn this simple non- necessity you start to have two level one level of nested nestedness and then you can have two three labels

and so on incremental occur on data in this case you do learn that network learns although Elman says oh actually this is not really kind of human like

this is not what we do we do use mother is a simplified language but at the same time we we speak full humanlike language anguage and children are exposed to full

human like languages from the beginning.

Another intuition which is a good point here is maturation in memory. And if you do this, what does it mean? In this Emma network, you can reset the memory, the

past memory in your short-term memory because it's a recurren network. And

what it does Elman in the training that for the first let's say two epochs, your memory can only memorize two words. Boy

chase chase girls. Okay, revive. Because

of the memory, you only pay attention to two words and then your memory grows over time. M matur you have maturation

over time. M matur you have maturation of memory capabilities. Then you can remember three words boy chase girls and then you can do four five. So memory

changes instead of the data changing and if you do this you have good training.

The the simple message here is that by changing the data structure or maturational mechanisms you can get a

small network to learn complex structures which is what we want to do.

Of course Elma in addition to to do do to this paper 93 and then in 96 co-edited with other people a paper which was all about changing our vision

of innateness. So it's not nature versus

of innateness. So it's not nature versus uh nurture versus nature but it's more complex than this because of the maturation and this is aligned really

with my presentation which if you seen my talk in other times where I developmental robotics with many of us work it's actually been proposing the same approach for years with a different

emphasis of course here more on the embodiment linguistic side.

So let me conclude because we should have time for a discussion here by I'm going to skip this by having the three challenges and potential discussion but

you can also of course ask your own question first challenge one embodied concrete words how do we represent sensory motor internal embodiment in particular internal embodiment how do I represent emotions for example

internally not only recognizing from a face of a personal robot and then is there understanding of LLMs do Do you think do you agree with me or some of you think actually there is structure

and understanding in LM abstra words there is semantic from statistics some theories say you get abstra words because of the relation between words other people say you need embodiment

what's your point of view and one theory of abstract words by Gabriela Voko UCL London is that if you represent for

example social relationship father friend peer these are non concrete concepts that you can then use to generalize to more abstract concepts.

But how do I represent this relationship social relationship within a you know the state of a neural network for example? And challenge three first there

example? And challenge three first there is no way back from LLM. I'm not saying don't use LLM. I'm saying use LLM where appropriate with the assumption that you

need to say. There is no way back. They

are too powerful and we will be using them in real world application and also in robotics. But remember the

in robotics. But remember the limitations and the assumption that you have to do when you look at this. And of

course ideally we need people who look at the future post LLM post transformers models which add allow selective

attention short-term memory maturation capabilities pre-wiring of your systems to look at the future of AI of course

and then the use of AI in human robot interaction and I think I can stop here by giving back control to po I guess

Patrick or and Claire.

Thank you Angelo for a fascinating talk.

Lots to think about there. Um so we've got some time for some questions now. Um

we haven't had any questions come in on the Q&A yet. So um please don't be shy.

Um please do type them into the Q&A or you can raise your hand um if you'd like to ask a question. And I can see one raised hand already. Um so I will uh ask

you to unmute and you can ask your question.

There is one from Patrick I guess. Is

this a question Patrick?

Yeah yeah I added a question as well.

Yes.

So you have somewhat hinted at this already but why do you think machines are so bad at understanding human relationship? If you refer to my last

relationship? If you refer to my last point you know how do we map this? I think it's not that they are

map this? I think it's not that they are not that the machines are limited. I

still cannot think of a way to represent you know my old point I've been a neuronet network person for 30 years we used to call ourselves connectionist

distributed PDP par distributed processing the idea is that within a neural network unit or layer you have a concept which is distributed not

symbolically defined so it's very easy to have in a symbolic network a node for father a node for mother not for parents

and not for friend and so on. But how do you let this emerge a distributed concept? How do you kind of find the

concept? How do you kind of find the data to show a network therefore a machine and robot with a machinery behind this concept of father mother and

so on. This is why I think it's

so on. This is why I think it's difficult. Not that machines cannot do

difficult. Not that machines cannot do it. It's it's hard for us to think of a

it. It's it's hard for us to think of a way to do it maybe.

Thanks. And in interaction like when the robot is in an interaction with different say family members that seems to be challenging as well to

Yeah. Yes. Exactly. Yes. Yes. you know

Yeah. Yes. Exactly. Yes. Yes. you know

just friends of trustworth and not trustworthy friend it's it's a challenge I still don't know how to do this other than cheating and having an artificial you know

layer or embedding that you say it's parent and child and mother and father but I I don't like this thank you um Zong would you like to ask

your question uh yeah can I speak now you can hear me yes thank you very much for the host hi

and thanks for speech from Angelo uh I know that you mentioned that lm is not language model it's kind of text model

uh even though we can have embodiment information to let alm learn later after a career of pre-trained language we we

give action and give body to lang LM then train again. You still don't think it's language, right?

Yes. For me LM is not language. It's

text. If there is no no grounded sensory motor representation, it still stays there. Let me say this, you can

stays there. Let me say this, you can take an LLM representation, you know, a layer intermediate layer, do your clustering of the representation and you will get this is what Elman actually did

in 1990 1999.

You can see that animals and human beings are share similarities. This is

exactly what our embeddings do, right?

So the system will extract regularities from the embeddings. But this is not because the system has experienced animals versus inanimate things. It's

just because the word dog and person and walk and run are more often together than I don't know other concepts. So

it's in the structure of the data, the tokens, not in the language. Because for

me, language is pragmatics. It's what I do with language in a sensory motor world. I know it's a bit of an extreme

world. I know it's a bit of an extreme position, but that's what I believe.

Yeah. But I still want to say uh maybe the machine and uh CPU GPU they create their own language and uh we call

it a large language model and we we we human have we or communication PS so uh uh I wanted to know if you uh think

there's so different between two system okay I like I like this uh proposal in the sense that I've been proposing something related

I when I started my PhD in 1995 395 there was a field I was part of this community called artificial life where the idea was you built animal like

agents more than humans but it could have been humans lon for those of you calciums langon these were the early artificial life people the idea was let's not simulate

you know dog cognition let's not simulate bird cognition let's simulate create new ways to fly, new ways to form a flock. So the flock the flocking

a flock. So the flock the flocking paradigm in bird flying doesn't have to be birdlike.

Why? Because you create an artificial way to to do the same things and then you try to analyze how this machinery builds this knowledge. So for me an LLM

is an artificial life approach to text uh mapping. So in this sense I accept

uh mapping. So in this sense I accept that we should use LLM to understand how an artificial system this is what I say is so powerful I really want to understand how it can be so powerful and

this will give us insights in how human cognition is powerful but remember we in this planet we evolved human like cognition there might be another planet

somewhere away where they use different ways to communicate which could be similar to what an LLM does which is purely based. So in this sense, yes, but

purely based. So in this sense, yes, but more as a speculation tool rather than a humanlike tool. Thank you.

humanlike tool. Thank you.

You very much. Thank you.

Thank you. Um, brilliant. We've got a few more questions that come in. Um, so

Zchen said, um, I strongly agree that current LLMs are not language models in the truest sense, but I would like to inquire whether it is possible to achieve mapping by simultaneously

learning tactile sensations and textual information within a shared feature space. Specifically, to what extent

space. Specifically, to what extent would the integration of haptic feedback and linguistic information influence the emergence of cognition?

Very good point and I agree with this.

So definitely yes it doesn't have to be only tactile. It could be visual and

only tactile. It could be visual and tactile. It could be I don't know what

tactile. It could be I don't know what to say in English gast you know it could be this the the mouth the foot sensing right it could be anything. So what

matters that whenever you train one multimodal system, these modalities are represented together. You don't have an

represented together. You don't have an a visual system, a tactile system which is separate from the LLM text only, token only which is separate from the action. They should all be treated

action. They should all be treated together. You should learn latent

together. You should learn latent representations which are inherently multimodel. So yes, I agree. Of course,

multimodel. So yes, I agree. Of course,

tactile is very important because we keep talking about vision. My talk, you know, focus on vision, but tactile is essential in human cognition. Theories

of human intelligence say that we are humans, you know, let's say more intelligent or higher different level of intelligence, maybe higher than

chimpanzees than a dog or what or a worm. because we have hands and we have tactile manipulation skills which are very fine.

These are also theories of language evolution that says that language evolved in humans not in other species because we walk upright, we have our hands, we can do we can use tools, make

tools and tactile sensing is essential of this. So I strongly support research

of this. So I strongly support research on multimodality with tactile.

Fantastic. Thanks for the great answer.

Um so Marco asked um firstly says thank you very much for your talk and says uh I want to ask you to develop more the claim by which we cannot achieve AGI

with transformers.

Okay. So I'm a psychologist. So I

studied the definition of intelligence and divisional intelligence is we say multiffac component or multiffactorial

int when we do a a IQ test. We do memory test recalling n items or chunks of items numbers or words and then we look

at working memory and then we do sound related there. Intelligence is made of

related there. Intelligence is made of many components. So if we mean

many components. So if we mean intelligence in the psychology sense multicomponent an AGI can sorry a transformer can achieve this.

If you mean for intelligence AGI that in one skill a machine is more powerful than humans we already have this a calculator which was invented many tens

of years ago. It's a super intelligent system to do mathematical calculations.

So we already have AGI for specific individual skills. But AGI in the real

individual skills. But AGI in the real sense that I have many capabilities that work together. This cannot be there in in an

together. This cannot be there in in an in a transformer because on the nature of the transformer again please when we do a summary with the transform with a

transformer right a GPT very good you know they work well we know they are not precise but you are

scientist right don't forget the task of a transformer is what's the next word so if you ask a transformer very simple question you

Uh, how do I make a cup of tea? The

system will answer you take a cup. Right? Let's pretend this is the answer. Very four words to answer

the answer. Very four words to answer question. How do I make a cup of tea?

question. How do I make a cup of tea?

Answer. You need a teapot. Let's assume

the correct answer would be you need a teapot. But if you do single word

teapot. But if you do single word prediction, if I ask the qu the question, how do I make a cup of tea?

The answer would be you. Okay. Okay, for

a transformer, the answer is U. Full

stop. Then you reuse the U. You add it to your input. So if I ask the question, how do I make a cup of tea? U

and then the system will continue. So

let's not forget this is what a transformer does. A transformer predicts

transformer does. A transformer predicts the next word. So how do I make a cup of tea? The answer is you. It's not you

tea? The answer is you. It's not you choose a a teapot. Don't forget this. So

how can you get language from a system that works this way?

Yeah. Yeah, it's a great point. Um,

brilliant. We had a question over on YouTube. Um, so I want to make sure that

YouTube. Um, so I want to make sure that the the YouTube audience get involved as well. Um, so Julian says, "What would a

well. Um, so Julian says, "What would a real machine understanding mean if a VLA puts the embedding for the word pushing close to that of a video of someone

pushing? Is that closer to how humans

pushing? Is that closer to how humans model such things?"

But if you give the system multimodality then you start to link words tokens with experiences which is an image which is a tactile pattern. So in this sense yes we

tactile pattern. So in this sense yes we move towards this if you have multimodality.

Now let me also add something you can have different levels of understanding in philosophia definition. One is that within your own system, this is what Steven Harnard said in the simple

grounding problem within your own cognitive system without relying on an interpreter you really do the mapping between images and internal sensing and words. This is the first level

words. This is the first level understanding that's why LLM which only as words can understand a VLA maybe yes because you are combining this concept together. So first level is you have

together. So first level is you have this multimodal linking of components of meanings. Of course second or third

meanings. Of course second or third level is you are aware of your own understanding which we move towards awareness consciousness which is not what I'm referring to in this case. Of

course that's interesting but here for understanding I really mean that the system itself can do the mapping between words and so on. So when I ask you,

if you go back to my slide and you answer the question, you will be a parrot spitting words to taking them from this mapping. There is no

understanding for you of any story which for me because I created a story in my Sicilian artificial world is meaningful.

There is no meaning to you. So when you speak Sicilian to me, unless you come to Sicily and learn the words with your experience, you are not understanding Sicilian. Although you are speaking

Sicilian. Although you are speaking Sicilian, I think, oh, very good. This

is what the Chinese lady thinks, right?

Oh, I'm getting the answer for the capital of China.

Yeah. Um, we are over time. Um, there's

one really interesting question that if you have a few more minutes, Angelo, I'd like to ask you. Um, I'm sorry that we're not going to have time to get to all the questions, but I thought this one was particularly interesting. Um, so

um, this was submitted on the um, the Q&A and the question is, "Do we need LLMs to understand if a could a robot still make you a cup of tea even if it

doesn't know what any of it means?"

Um if you find a solution well like artificial world remember artificial life if we need a technical solution to do this it's possible we don't we don't have to care about this being a

transformer or not I said at the beginning my approach is to use this system to understand cognition and also because these systems have to do with human have to interact with

humans so I would expect this is explainable AI right if you give me a black box a full black box where I cannot really interrogate or go inside

the mind of this artificial robot. I'm

not sure I'm remember there's a link between sorry remember for those familiar with the field trust and theory of mind you know an explanation you trust the robot if you can get

explanation from the robot about its own decision making and if the robot using humanlike mechanism it can explain them to you if the robot is using it own artificial

world it cannot give you explanation and you will don't trust the robot so there are benefits in using human inspired cognitive approaches.

An aircraft doesn't have to, you know, flip its wings to fly. So I might be happy with an aircraft that that is not fully related to, let's say, animal

cognition. But there are levels of

cognition. But there are levels of components which are kind of important especially because we are talking here about our communities human robot interaction. So these robots are our

interaction. So these robots are our tools. They are there to help us and we

tools. They are there to help us and we need to have full control of to these tools also from an ethical point of view. Please when I get old don't give

view. Please when I get old don't give me a robot which I cannot you know interrogate because I I want to know what I'm using.

Yeah. Yeah. Absolutely. That's that's

such an interesting point. Um I'm I'm really sorry that's all the time we have for today's seminar. Thank you so much to everyone who um submitted questions and I'm I'm really sorry we haven't had time to get to them all. Um, I'd like to

thank Professor Kangallocei for sharing his insights. Um, and thank you to the

his insights. Um, and thank you to the wonderful audience on Zoom and YouTube for contributing to a really, really interesting discussion. Um, so this is,

interesting discussion. Um, so this is, um, part of an ongoing series of seminars that are being organized by the UK HRI topic group on methods and best

practices in human robot interaction research. Uh the next seminar will be

research. Uh the next seminar will be announced soon on our website and I've just added links to the chat so that you can um keep an eye on the website for announcements. Follow the group on

announcements. Follow the group on LinkedIn and if you want to get more involved there's a discord community that you can join. Thank you again to everyone for joining us and I hope to

see you again at our next seminar.

Thank you very much Cla. Thanks Angelo.

Byebye. Thanks everyone.

Thank you. Thanks for the invitation.

Loading...

Loading video analysis...