Language Understanding and LLMs with Christopher Manning - 686

By The TWIML AI Podcast with Sam Charrington

Summary

Topics Covered

Linguists Were Right About Structure Discovery
Pattern Matching Masquerades as Reasoning
Foundation Models: A Fundamentally Different Paradigm
AI's Diversity Crisis: Thousands of Languages Left Behind
The Real Frontier Lives Behind Language

Full Transcript

looking behind the veil of language and these questions of reasoning and intelligence and how humans store their

knowledge of the world we have such good language understanding and generation that a lot of what we're missing is then the stuff behind that and that's going

to make all of the difference in giving us intelligent machines but also intelligent language users

[Music] all right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington today I'm

excited to be joined by Christopher Manning Chris is Professor of machine learning in the Departments of linguistics and computer science at Stanford University director of the

Stanford AI laboratory founder of the Stanford NLP group and an associate director of the Stanford Institute for human centered artificial intelligence

or Hai Chris is also the 2024 i e John Von noyman metal recipient which recognized him for advances in computational representation and

Analysis of natural language Chris congrats on the recent award and welcome to the podcast thanks a lot Sam and it's great to be on the show with you I am looking forward to digging into our

conversation we will have no shortage of things to talk about you you have been uh involved in a lot of the foundational

research that has uh laid the stage for llms and gen and in fact I think I want to start Us by zooming out and talking about just that you know we can Circle

back to some of these things individually but you worked on the glove paper which kind of laid the foundation for us to better understand and use word

embeddings uh you worked on some of the early application of attention to NLP your work on joint text and image contrastive learning is kind of set the

stage for multimodal ML models I'm really curious to hear has it surprised you the way all of this has come together to create tools as powerful as

llms and multimodal models or at some point you know was it clear what would become possible and it was just a matter of putting the right pieces in place

think it's absolutely surprising I think for anyone who's been working in the field for more than a decade I mean I guess for me it's about 30 years I mean

it's just astounding this upward um trajectory that we've had things started to take off about a decade ago but really in the last um five years things have just zoomed ahead and well you know

you can trace back the predecessors there's sort of a path that once things started happening with large language models you could could see that there are possibilities from scaling further

but you know nevertheless just the way things have emerged with these abilities of large language models it's just really very surprising and came in an

unexpected Direction and to what degree did your background as a linguist uh impact positively or negatively your ability to to see and contribute in so

many ways well in general I think me being a linguist in a field where I don't know 97% of people aren't linguists that they're machine learning

people or sometimes physicists or other more General backgrounds that that's given me a really distinctive Viewpoint and it has allowed me to contribute in

compliment ways and showing um the role of language but on the other hand linguistic knowledge is a good broad

background I mean in terms of What's led to these extraordinary advanc says there's no doubt at all that the main engine has been in the machine learning

and the math it hasn't really been um coming straight from Linguistics MH I'm imagining that you as a linguist there is a degree or has been a degree of

tension maybe you know we're long past that but at some point there was a degree of of tension and kind of not wanting to let go of traditional linguistic approaches and uh needing to

embrace statistics like I I'm curious like how that you know appeared for you and to what degree you had to Grapple with that what that really meant I mean

it's a great question and it's absolutely a live issue very much so to this day um so academic Linguistics especially in the United

States but in general worldwide in the second half of the 20th century leading into the 21st century the dominant figure was Nome Chomsky and Nome is

really getting on at this point but you know he's still active and in his recent pieces about language um gome Chomsky is

loudly proclaiming that large language models are of no interest to Linguistics and tell us nothing about human language

many um of chomskian linguists who actually dominate Linguistics departments only in the United States you know that has been the picture that

was never my picture so although I definitely started off as a linguist you know how my path shaped was I thought there had to be a lot of learning

involved in human language acquisition and I was interested in how that could happen and that led me to start then looking at machine learning and things

built from there and so you know there's other more empirically minded linguists um who see really good points of connection between large language models

and what happens in linguistics I shouldn't go too deep a riff on um the history of linguistics but there's actually an interesting history here CU

you know way back in the sort of 30s 40s and 50s that dominant um strand of linguistics was the American structuralists and really what they

believed in was that if you had a body of texts in a language they were thinking of something like a Native American language where they' collect stories and so on what you should be

trying to do is Define an inductive procedure so you could learn the structure of the language from the texts now you know in the 1940s and 1950s

basically nothing was known about machine learning or other related fields like information Theory right shannoni invented information theory in 1951 so

you know they didn't actually make much progress in this goal but their goal was to be able to come up with the linguistic structure Discovery procedure where you could look through the texts

of some language and start to learn the structure of sentences and you know really the first thing that made Chomsky famous was trying to argue that no such linguistic structure Discovery procedure

was possible so you know despite the fact that they didn't have much in the way of methods I I really think of it as no these people before trumps you were

right that is the right kind of goal and we're now actually see seeing it happen in the age of large language models and so you mentioned that the

chumpkin Viewpoint is that llms don't have anything to teach us about language you know in what ways do you see llms teaching us about language yeah so I

mean it is important to point out that you know human language acquisition um baby starting to learn language is obviously nothing like training a large language model so know there's a lot of

difference there and I don't want to in any way deny it but the part where it's interesting is that choman Linguistics has been founded on the notion that you

can't possibly learn the structure of a human language from The observed evidence and that leads into his theories of the innateness of human

language and um so there's very little learning involved where precisely what large language models show is that actually you can learn the structure of

a human language and so this is something that I worked on once large language Model start coming to the four around 2018 2020 you know with my

linguist hat on I was not interested just in the fact that these large language models could generate sentences I was interested in well what do they know about the structure of English or

the structure of French or the structure of Chinese and actually you can poke around in the representations of these models and find out oh yeah they know about subjects and objects and

predicates and relative clauses and you can that they actually have learned to decode the structure of English sentences and that's part of why they're so good at generating fluent um English

or whatever language as we sort of see in the output of chat GP or similar models so do you see llms is you know somewhat of an existence proof that this

this learning does happen and it sounds like the champkin still resist this like what do what do they say right yeah so it's an existence existence proof that you can learn the structure of human

languages from necessarily that we do yeah from a ton of data um so the kind of arguments in the reverse direction that would be made by Chomsky or similar

people well firstly the amount of data that large language models use is completely unrealistic for human language acquisition so humans become

good language Learners on you know somewhere around 50 or 100 million words of data whereas large language models being trained on a minimum of billions of words of data and the Very largest

language models now are being trained on a trillion or more words of data right so there's a huge order of magnitude difference there but perhaps more profoundly what Chomsky would like to

argue is that these models are much too General that large language models or similar new network models kind of suck the structure out of anything and so

that they can easily as easily learn patterns in DNA or patterns in chemical formulas as they can use human language as they can learn human languages and so

Chomsky thinks that's profoundly wrong because there's a lot of commonality in the structure of human languages and so he's always wanted to seek a very sort

of restrictive model which can you know describe only what's found in human languages interesting interesting I I've always found it super interesting the

kind of the two-way street between the biological side or the The Human Side and the computer science side in particular like the Neuroscience side

neural networks are you know quote unquote uh inspired by Neuroscience uh and that kind of pushes the Cs side forward but then the Neuroscience folks take what's done on the computer science

side run with it and push our understanding of the the Neuroscience forward and it sounds like there's an opportunity you know here on a language

uh from a language perspective as well you we've talked a little bit about you know chy having his opinion or chomskian having their opinion on you know the the

place of an llm like what research has to happen from a Linguistics perspective going forward to build on you know this

existence of llms and learn more or teach us more about language from a human perspective so to start learning more from a human perspective we need to

start exploring models that are much closer to the human context of language acquisition but I mean yeah I think

there is this really useful um two sides that can beat off each other that you were just talking about because you know if you want something something closer

to human language acquisition well human language acquisition is situated that you know you're in an environment with things in the environment they're being talked about and so well you need a

multimodal um Foundation model well multimodal foundation models are just the kind of thing that people are starting to work on now but even then most of those models you're saying you

know here's a big pile of images and text about them where it's very clear that it's Neal to human language acquisition that is actually interactional that you're in an

environment and there are people talking to each other um pointing at things um talking about things looking at each other and that interaction is highly key

to human language acquisition right there have been some sort of unfortunate natural experiments where um people who can't afford Child Care think that maybe if they just leave the TV on all day at

home their kid will um learn learn language the same way as a person um with a caregiver and that's just not the case right the interaction is all

important which brings to mind other areas of contemporary research around situated AI embodied AI themes along those lines you sounds like you see

those as being critical yeah I think they're super important areas to start investigating to kind of get more connection and relevance from the human

learning context yeah did you come at your your research and your career broadly trying to create intelligence or were your you know steps and interests

you know different or more discret or or something else like I'm curious like this threat of intelligence you know through your research is it you know very directed or organic or you know how

how do you think about intelligence as a goal yeah I mean I think it emerged it's definitely not where I began where I began was human language seems

fascinating um humans can do amazing things and learning and understanding each other's speaking language how could we use computers as a way to understand

and model that so I was sort of very focused on language and really that's been the bulk of my career so it's

really as time went on and especially in the newal networks era when there started to be these very general methods newal networks where the the same kind

of models and methods um can be applied to Vision robotics language that I slipped into um being an artificial

intelligence researcher and then it's with the enormous success of large language models um leading to things like chat gbt and other models obviously

claw Gemini Etc um that then um everyone is much more concerned about what is intelligence and these things intelligence so so I sort of have sort

of slipped into being concerned with intelligence but it wasn't really what my long-term go was originally yeah that's what I was really curious about the the the degree to which that is a

primary concern and and motivator for you or you know it is it ancillary to other things and you know given that you've kind of slipped into it do you

see you know where we are you know what do you see as the relationship between llm and intelligence do you do you see that we've created you know some degree of scale down intelligence do you see it

as a small stepping stone do you see it as a dead end to what people really want in terms of intelligence like how do you make sense of llms and and current AI in

the context of intelligence right I I think it's definitely an important stepping stone and a quite dramatic development which has given us something

very different so um let's say the positive part first right so in general in AI there's this long-standing

distinction between narrow intelligence and general intelligence and everything that was done in machine learning and AI

prior to large language models could only be described as building artificial narrow intelligence because typically what you're doing was you wanted to

build a system whether it was something to recommend in movies or to recognize birds in photos or to say whether a piece of text was toxic whatever you're

doing you got collected data for that task you trained your model you had this thing that could make a decision in some domain but it had no intelligence whatsoever beyond that whereas the goal

that people had dreamt of for the entire history of AI was to have an artificial general intelligence um like a human

being who can do all sorts of things and so in precisely that sense of the definition I think we've got one right

that um in the age of large language models like chat GPT we now have this fairly general intelligence you can ask it to write a poem you can ask it to

translate this piece of text into Chinese you can ask it to summarize this long boring report you know get recipe hints or hints on where to go visit and

Vienna right you know it's a very general intelligence at this point I have to go onto the caution side which is you know people use the term

artificial general intelligence these days in two senses one is that sort of historical technical sense but in common usage it nowadays more often means this

is something that's so intelligent it's as good as humans or better than humans in most respects and at that point I

think we need to be really cautious and I think a lot of people are fooled as to how much intelligence there is in these large language models you know they can

do absolutely incredible things and producing beautiful text and I I realize that but you know I sometimes use the

analogy that a large language model isn't really much more intelligent than a talking Incycle IIA cuz you know most of what we're impressed by with large

language models is how much stuff they know and you know I think historically we've tended to value that in human beings as well you know this person is

really knowledgeable um but you know I don't actually think that's the heart of human intelligence the heart of human intelligence is being able to adapt to

new situations to very quickly learn new things that's the real intelligence and that's not what large language models are doing large language models look

amazing because they've slurped up all of this knowledge that was been hard one by humans over centuries and it's all been you know shoved into the machine

that they're not sort of quickly picking up and learning new things going about the world in the same way that human beings do at the same time we've seen the the

quote unquote emergence of properties like the ability to reason in sufficiently large llms um how do you think about you know

and then like there's a desire to then extend those capabilities into kind of agentic systems and workflows that combine you know the knowledge and the

ability to reason you know with um to create something that kind of can go off and do things in a a more you know intelligent way like how do you parse

all of that and and the limitations that will end up finding there yeah I I actually don't think we should say at the moment that large language models

can reason they can behave in ways that it looks like they can reason um so I mean and again this comes from them having just huge Knowledge from all of

the billions or trillions of words that they've read so to the extent that there is a pattern of reasoning that's

commonly available and they can mimic it they can produce um answers that look like they're reasoning through something

but you know it's also obvious in many cases when they make sort of glaring mistakes that they're not actually reasoning they're sort of just pattern

matching and if there's a pattern of a sequence of steps and they've seen examples of it they'll apply it to a situation and put in the terms you've used and sometimes it works and

sometimes is glaringly wrong and they don't really know the difference and certainly if you give it give a large language model clearer problems like um

planning problems where you have to sort of plan with some constraints as to you know what day someone's available and how far they have to travel and how you going to get things to move around um

various people have looked at planning problems and you know the current large language models just can't do that kind of planning they just can't so reason with those kind of constraints so yeah I

mean although there are lots of claims of large language models reasoning I think at the moment be cautious but on the other hand you know um there are

other places such in playing chess and go where people have hooked up newal nets with search procedures um where

they really do sort of plan and evaluate positions and so on I do think in the coming decade it's quite likely that some of that kind of search and planning

technology will be hooked up to large language models and we really might start to see machines that can reason and as you mentioned um for tools and agents once you're sort of connecting

those in that'll give language models new powers to calculate and perhaps plan things using these external tools so I think we're on the cusp of some of those things becoming possible but there's

still more to do I laughed a bit at you started uh responding because I think that that view of reasoning is the one that I tend

to gravitate to and I've um made very similar arguments talked a little bit about needing to distinguish reasoning as a mechanism versus reasoning as kind

of an outcome or an observable behavior and I I've talked about this in in lots of interviews I think what I tend to hear back most is that we really have very little understanding of reasoning

as a mechanism in humans so how can that be the bar and you know all we can do is observe this thing and uh see if it looks like uh looks like reasoning and

from that perspective llms seem to be doing some of it to some degree any reaction to that you know they seem to be doing it to some degree and you can

show examples um where it absolutely looks like the language model can reason

because she'll show it some pattern you'll ask it some question about you know how many kilog of stones can 10

people carry if one person um can carry 20 kilg and it will say if one person can carry 20 kg and then there are 10

people we should multiply 20 kilg by 10 people and therefore it can carry 200 kog and you think to yourself oh yes um

this totally looks like reasoning it's it's understood the problem it's done the calculation it's explained it it is

wonderful then you ask some different question where it should be you know

obvious what the answer is and you'll come up with some variant problem which

might be saying um if groups of three people can carry a 60 kgam box and then

you have nine people how many kilograms can they carry and it'll go through the same formula and say okay um you've now

you've got nine people and the boxes weigh 60 kgrs therefore it can the nine people can carry 540 kilg and it's just

oh no that's not right at all it's pattern match the same kind of framework but it's just did not pay attention at all to what was there and I think that

is the current state of things so you can show people who want to claim they can reason can show examples that look just like it can reason people want who want a naysay can rightly show bloopers

like that and I think the bloopers do show that this is more kind of pattern matching of reasoning patterns and there isn't sort of good

understanding of situations as an awake human does successfully have you're saying something like sufficiently Advanced pattern matching is in indistinguishable from reason and yet

not reason yeah that's sort of it right I mean on the one hand pattern recognition

which was sort of once sort of dissed as a trivial form of doing things that sort

of has really become the centerpiece of AI that we've shown that by taking what's essentially pattern recognition technologies that are pioneered for

things like speech recognition and um object recognition Vision that you can really push what you can do with pattern recognition Technologies and they've

become our most powerful Tools in artificial intelligence you know a large language model is much more using the approaches of pattern recognition but you know

nevertheless I think a lot of people the wise people like myself um think that to actually get to some of this sort of higher level cognition of planning and

reasoning that we do still need to have new breakthroughs and approaches that we don't yet have mm and we don't have them

is it clear clear is probably a stretch but like do do you have a sense for what stones that we need to turn over to hope

to find them not a clear one yeah so I mean I I think the way research tends to work right is things are sort of stuck

somewhere and then people find some really promising way that things can be pushed forward and you climb up those stairs and then it tends to flatten out

until someone comes up with a new good idea you know so I think one of the things that people feel like we really need is much more in the way of a world

model so that there's sort of a deeper understanding of the world and to some extent large language models seem to create a world model but it's this

really weird one since it's sort of this representation above the tokens in a sentence or paragraph or whatever it is in the language model it doesn't seem

like it's a very actionable World model language models just sort of generate forward and well people have done some tricks with Chain of Thought reasoning and things like that but it seems like

really for thinking you have to be able to explore outwards and go forwards and backwards and search and so to sort of work out ways to do that kind of exploration and search and thinking we

need something like that and some people are working on that now so you know there are ideas and directions of think promising but there's a difference between having good ideas to roughly

what you need and finding a really productive way to push things forward as large language models have been mhm MH very early on in the conversation I

mentioned some of the the key research innovations that you've contributed one of those is uh glove and kind of our understanding of embeddings and I'm

wondering if you can you know riff a little bit on the way that you think about you know embedding and the relationship between words and a vector

space and both in and of itself and the way that you know that element has contributed to some of the the pieces that we've built on top of it yeah so um

word embeddings are these Vector representations of words so real numbers in a big Vector um at least hundreds of Dimensions maybe thousands on the

surface that seems a really um weird way to represent the meaning of a word it's only not what people have thought about for the rest of human history but it

proved to be a very successful way of capturing the meaning of a word and its relationships to other words far more successful than methods that had

preceded it and it was also one of the first highly successful um cases of doing this unsupervised or self-supervised learning where you're

just getting a huge amount of text and saying okay relatively simple neural network cogitate on the relationships of words with their words in the context

and by then doing this optimization process you got a word Vector for the meanings of Words which just do a really good job at capturing the semantics of

words you know I early on in my NLP class I demonstrate you know glove word vectors in the space and I mean it sort of actually still seems to me kind of of

incredible how well something that in retrospect is relatively simple manages to capture word meaning so word factors

were really the key tool that led to the take off of new network methods in natural language processing in the

2010's generation but they also have limitations and although there's still lots of use of word vectors in all sorts of places for the sort of Heart of NLP

we've kind of moved Beyond them now because with the modern Transformer large language models the key difference is that for word vectors we were

learning one vector for a particular word and regardless of the context in which it was used and in reality most words have lots of different meanings

depending on the context in which they use so a star means something very different in a discussion of Hollywood than it does in a discussion of astronomy now the surprising thing is

word vectors can sort of cope with that they kind of make these sort of disjunctive vectors that put all the different um senses together but when

we're in a particular context we want to know more about how to interpret the word in that context and we can do that with the kind of contextual representations that we now compute um

for words and in one of these large language models in terms of those contextual representations do you see

I'm trying to uh to bring together the important role that embeddings still have in Practical use of of generative AI systems like retrieval augmented

systems like they're still front-ended by you know Vector search and embeddings and yet you're saying you know we're past embeddings in NLP like yeah

reconcile that for me okay so it it's it's a mix I mean so if you're wanting to do things with particular words and their meanings

looking for you know gender bias and language or doing Vector retrieval lots of good uses for word embeddings and

they're still very widely used but for the center of you know where the big advances are happening in NLP which is

with large language models well you know sometimes people in starting off do initialize their Transformer with word

Vector representations for the words but in principle you don't have to you can just start training on a enough text and build a huge um Transformer llm and it

does so it learn it learns word it does learn word vectors so the very bottom of a transformer every token does have a word Vector but the meanings that you're

using using in your application isn't what you have at the bottom of the Transformer it's what you have at the top of the Transformer and then that's a context specific um representation of a

word and so then if you're wanting to do things like matching pieces of text for their similarity you're generally using those topof the Transformer

representations to say whether the meaning of a piece of text is similar to each other so you know it's a mixture they're still widely used

um but if you're thinking about gee what makes CET PT great um a lot of the games move somewhere else sure sure uh but I'm also wondering are are you hinting at a

future where the you know still widely used applications of embeddings give way to you know pure Transformer architectures like for retrieval for

example is embedding just kind of natural in the way that we'll probably always do these kinds of things or is there some analogous pure Transformer architecture for retrieval

that you're foreseeing I mean a lot of these uh technical questions of sort of how many resources you want to put into things so

I think there's always going to be a place for word vectors I I'll go it sounds like you're saying it's kind of an engineering problem and like how much do you want to throw at it but uh right

well yeah and like a kind of marginal utility kind of thing yeah cost size models Etc yeah got it got it got it we

talked a little bit about glove and the and embeddings you talk a little bit about attention and the way you think uh

back on that work and uh the future of uh attention as a mechanism sure yeah so attention is the idea that in the newal

Networks you're doing this calculation to find other places in your new network where you have somewhat similar stuff so

it's a kind of content based addressing and then that you're using um information from those other places to help you to decide what to do what new

representations to produce so attention is really the big exciting new idea that appeared in modern new network so a lot

of the things that people um started doing in the 2000 in 2010s with newal networks are actually things that been done long before um

it's just people hadn't had the Computing and data scale and some technical ideas to get them to work well so you know around

2013 to 16 you know the NLP the big model that everyone was using was lstms but actually lstms were invented in the 1990s and similarly en Vision everyone

was using convolutional neuron Nets but convolutional new Nets go back to the ' 70s um so you know it was really sort of all old stuff that was being reinvented

whereas in attention was actually genuinely um something new this idea of doing this content based look up and it really was a huge breakthrough that

improved all of our NLP systems in the mid 2010s so I was first invented for use in New Or machine translation but then it was quickly extended to be used

for question answering system summarization systems yeah so the first in version that was invented um came from the University of Montreal um but

then a second simpler way of doing things which we call bilinear attention but much of the rest of the world called multiplicative attention was then

developed by me and students at Stanford and um it's this simpler form of attention that was then the form of attention that came to be used in

Transformers and well the original paper about the Transformer architecture was called attention is all you need um and well it's not quite true that the only

thing in a Transformer is attention because they also have fully connected layers that are important and they also have residual connections which have been you know developed in vision and

other places and they're important but nevertheless the distinctive main item in a Transformer was making use of this new concept of attention even more

extensively than been used previously MH and in what ways have attention mechanisms and uh research

around the application of attention evolved since the initial Transformer paper so one answer is kind of not as

much as you might think I mean because in some sense you know the authors of the Transformer paper people at Google

you know they either really really smart or really really lucky maybe some combination of both but you know they really just seem to nail a good architecture and you know it's just I

think to many people including myself it's just been surprising the longevity of Transformers that it's it just seemed like this put all the right pieces together roughly right you know there

have been tiny variant sense of moving the layer Norm down but very small changes and it's still what people are using I mean I think there are starting to be some new ideas I mean as people

are wanting to deal with much longer context I mean there's interest in hierarchical versions of attention so you can have sort of tre structured

attentions um heading down and so I think we starting to see some new ideas there's this funny thing that for the last um five years that there hasn't

been the same kind of innovation in models and architectures that there was for the decade before that people have been trying to sell the story that just keep with what we have and make it

bigger and bigger and bigger and we'll solve AI which I don't actually believe myself uh so what are you most excited

about um you a is it solving AI quote unquote but then you know B you know what is what is your your goal or or you know kind of guiding star and then what

are the research efforts that you're finding most exciting to to kind of help you get there sure yeah so there there are lots of different things that one

can do but the two things I've been most interested in lately is one well actually I am still interested in neural architectures because I can't believe there'll be Transformers forever and so

I've been working with students on you know what are some architectural ideas to do things differently but then in the other direction we're we're in this new world of the large language model or

it's General General Iz ation to other forms of data the foundation model world and I think when people look back on things rather than the age of new

networks being the big breakpoint the big breakpoint will be when we enter the foundation model era cuz that just gave this fundamentally new way of doing

machine learning and AI systems right before it didn't matter if you were doing new networks or you doing support Vector machines or you doing regression trees or something else you know it's

the same recipe you collected your data you labeled your data you trained your a AI system where it's really in this modern era that we have this profoundly different thing of you've got this large

Foundation model and you can just ask it what to do or you can give it three examples for in context learning and suddenly it'll do things right so we're in this very different world but because

that world is so different very little of it has been explored so far so so just thinking about the pieces of that world almost all of them people came up

with some way that roughly worked but there are ways to think of better ways of doing the different pieces of it and for that ecosystem there are sort of lots of things that people haven't built

yet and going around building them so if I stick with that bit first you know a couple of the kind of things that I've been doing with students there is you know well if you have one of these

humongous models they know a lot of facts of the world but facts of the world keep changing someone new as the prime minister or president and you'd like to be able to sort of cheaply

change um the knowledge in a model so how can you do edits for um a few facts in the model in a way that's relatively cheap so that's one area and then a

second area is that you know the difference between gpt2 and gpt3 which were very good at generating fluent coherent text and chat GPT that rocked

the entire world and was appearing on national news broadcasts came from the instruction tuning that allowed you just to tell it what to do or ask a question

and get a response and so that was initially done by reinforcement learning from Human feedback rhf it sort of seemed like open AI has this clever

reinforcement learning guy John Schulman and he did it with um po proximal policy optimization and so everyone else said

oh well that's obviously the way to do that we'll use po as well um and a bunch of students and US faculty were thinking about this and really there was no

reason you had to use po and given the way that people are actually aligning these language models from paired better worst data that had been collected

offline it seemed like it was unnecessary and overly complicated and expensive in resources to do traditional re reinforcement learning which is was

really built for sort of an agent learning online as it wandered around the world kind of things and so we came up with this alternative direct preference optimization DPO which has

been an enormous hit because training it is much more stable it has way less in the way of choices and hyperparameters that you need to fiddle to get it to work well it's much less resource

intensive so you can run it without having huge Computing resources so at the moment nearly all of the sort of mediumsized open-source language models

are now being aligned or instruction fine-tuned using the DPO algorithm so that was a a great research success but I think in retrospect the biggest

success isn't sort of the specifics of the DPO algorithm but as pointing out hey you can do this in other ways because um since since we published DPO

you know a whole bunch of people different colleagues of M at Stanford people at Deep Mind um people at Berkeley right there's sort of now at

least five other alternatives to DPO and po that people have now used and to our first approximation there are lots of other ways that work great as well we

sort of open this Jam door of showing that you know you can think about this problem in different ways um yeah and so that's the kind of sense in which we're in this sort of new world and a lot of

it hasn't been thought through very much um but if I just quickly um revert back to other kind of architectural ideas

yeah I hope sort of prominent new alternative architectures will emerge so one idea that I've been thinking about with students is trying to sort of have

models that prefer a soft form of locality and high Archy that sort of Transformers just sort of look out everywhere for any associative signal that they can find if there's any

association that seems predictive they'll just glom onto their attention whereas I think part of how humans learn quickly with little data is that they

have a bias that normally things that affect each other close together not always but so 95% of the time and so we've been sort of exploring a couple of

architectural ideas about how you might do that um you know and when you say close together do you mean spatially in particular and you know drawing that

analogy to humans okay yeah so does that imply uh some type of embodied setting so that would be one way to explore that idea um to be honest I haven't been

doing that we've been looking at it in the language case still so close together as meaning close close together in terms of what is said or what is in

the text but I mean I do think the same idea extends out could be done in those other contexts but yeah haven't actually done it okay um and so continue the can

you talk a little bit about how you have set up that research Direction and some of the experiments or or promising directions that you're seeing there in terms of leading to new architectures

yeah so um the main thing that we've done there which we called push down layers and this was sort of with grad my

um that it's related it's still related to the Transformer architecture but we're adding this extra thin layer between the layers of Transformers and

its role is to learn something about local structure so it's sort of then giving a preference to sort of have things stay close by and we were able to

show that this could give advantages in Faster learning and then better trans transfer to other different domains of data sets because that that was a useful

inductive bias to work from I guess I thought that the that proximity was a part of uh Transformers and and why they

work already and so I'm um yeah looking for greater distinction between what you're describing and yeah so it's

actually not so earlier earlier models like an lstm model was very to things nearby affect each other

because there's sort of a sequence model like even older models like um hmm But in a Transformer it's really does attention

anywhere inside its context and there's no architectural reason to prefer to be influenced by the things nearby to you rather than things that are 173

positions to the right um you know it's but there's some degree I guess of proximity that's enforced by variables like batch sizes and and things like that right so to the extent that you've

got a certain context window it has to be um within that context window but you know as time has gone by Transformers now have very large context Windows

right it might be a 16,000w context window or some of the recent models are saying there's a 100,000 word context window and within those huge context

Windows the model doesn't prefer local versus far away um you know the model can learn that local is more useful and you know Transformers do learn that

local is more useful but you know that's sort of part of why they need trillion words of training data is to learn these things whereas maybe learning could be

made Faster by um giving some more biases into the model MH and what's the current state of of that do you I

imagine that there's kind of validating it uh on a small scale and then um part of what's made Transformers so successful is demonstrating that you can scale it broadly and you know that

scaling laws take over uh how far have you gotten with that yeah I'm we still at the valida a small scale stage I'll be honest um yeah super super

interesting you know so before we wrap up you know we've talked a little bit about some of your current areas of Interest I'd love to hear you riff kind of more broadly on you know just once

again like the way you think about the field and how it's evolved and you know where you see everything going sure yeah so with the great success of large

language models there's a real sense in which actually there's so much we can do now with language understanding generation right like in earlier decades

almost nothing worked in NLP right that you try and get um facts out of a piece of text and you were lucky if you you could get 50% of them right whereas things aren't perfect but we've just got

way better capabilities and so it almost feels like oh we can do language understanding and generation pretty well now and so the question is then where do

things head from there and there are many possible directions you know there are other ones like multimodal models of doing things between language and vision is certainly a good area but if I

mention the S of two that I'm most interested at the moment you know one is that our current Technologies of large language models really only work for

major languages they work great for English or Chinese or Spanish or German and you can do reasonably well when

you're then going to sort of something like I don't know Dutch or Czech but you know there are thousands of languages in the world and for the vast majority of

them there just isn't much training data available you know some of them actually have millions of speakers like some Indian languages but the amount of um

written text available isn't in the billions of words let alone the trillions of words and then there are lots of smaller languages where there are only sort of small speech communities and you'd be lucky to get

millions of words of text available so there's questions about how to extend some of our breakthrough methods um to the rest of the world's languages and there are interesting ideas there about

how you can do transfer learning or exploit the commonalities of human language or human language families and so that's one interesting area um but

perhaps the most profound one is looking behind the veil of language and these questions of reasoning and intelligence

and how humans store their knowledge of the world that these questions are much more coming into Focus because we have such good language understanding and

generation that a lot of what we're missing is then the stuff behind that and that's going to make all of the difference in giving us intelligent

machines but also intelligent language users because a lot of being an intelligent language user is having that knowledge behind being able to say in fluent sentences you're actually saying

the right things in the fluent sentences mhm and so what are examples of the things you know the stuff behind that well so better understandings of

knowledge so we to touched on that earlier that although to some extent these models know a lot they don't actually reason well of understanding

how facts fit together so they'll three times say something that is right but then they'll just say something completely wrong right you get this all the time you sort of say where did

Christopher Manning get his PhD from Stanford University and then you'll just ask it differently and or ask something differently you'll say was Chris

Manning's PhD from berklyn or say yes right you the models just don't have this coherent knowledge behind them and to start working out better ways to get

sort of consistent knowledge and World models behind um our language facade I think is what's going to take us on some of the next steps towards artificial intelligence

mhm and it it sounds like you see that as more fundamental or foundational than you know quote unquote fixing

hallucination uh it's it's a a fundamental shift in architecture or approach or the way we think about you know building these models yeah I think

we still need major new developments and new networks before we'll get there yeah yeah yeah awesome well Chris um there's

a lot worse that a researcher can do than being the one that kind of pushes the stuck doors open uh so congrats on uh your success and all the impactful

research that you've done and looking forward to kind of keeping in touch and seeing more thanks a lot Sam it's been great talking to you [Music]

[Applause] [Music] [Applause]

[Music]

Loading...

Loading video analysis...