LongCut logo

YouTube Video

By Unknown

Summary

Topics Covered

  • Industry has vastly more training data than academia
  • DPO democratized alignment research
  • Reward models are crucial yet poorly understood
  • Online methods are the frontier of alignment
  • Small model alignment remains underexplored

Full Transcript

okay um well uh welcome back to cs224n it's welcome back for me to cs224n um too since I was traveling for a couple of weeks I hope everything went

smoothly in the meantime um so today I'm delighted to introduce our first invited speaker Nathan Lambert um so Nathan um

did his PhD at UC Berkeley so you're allowed Boo and hiss for that but um but um since then um he worked

first for a couple of years at hugging face and now he's working at ai2 the Allen instit the Allen Institute for

artificial intelligence um in Seattle um so Nathan um comes from a background in reinforcement learning like quite a few other people who are now applying reinforcement learning to language

models he had an early background applying reinforcement learning to robots but it turns out it's more fun to do it with language models um um no it's

not um okay um but anyway I mean he's been very influential in both developing ideas as to how to do posttraining with

rhf and other ideas that come since then including DPO that he'll definitely mention in today's talk um and so he's one of the so best experts on the

posttraining um phase of language model development which has just proven as time is passed by that more and more of the action of the large language model

companies is happening not in the the initial um pre-training language model training phase but this subsequent posttraining phase and Nathan will have a lot to say about that today thanks a

lot for coming to do this yeah thanks for the wonderful intro um you can see my talk is life after DPO which is a little bit of a unclear title so I apologize about this but it's trying to

capture like what is the moment that we're at in alignment and Alignment research and really DPO is the paper the story of last year which is this paper that came out and I'll get to the math

and now a lot more people are interested in able to do alignment and it's building on from there so it's like what what are we going to be interested in after DPO and a tidbit talking with

Chris that isn't explicitly in my slides is like what we're trying to close and the labs like meta and people with the amount of data that they're using for this kind of post

training um fine-tuning there's all these words all defined is so big that like the amount of data points that meta bought in llama 2 from one of these providers is much more data than all of

the data that's been collected on chatbot arena for mmis so chatbot Arena has like 800,000 data points that have been collected and metat 2's paper says they bought about 1.5 million

comparisons and these are years outdated and chatbot Arena's data is that's as of a few weeks ago so you can only imagine what op AI anthropic Etc are buying at this scale and this is the kind of

reality that we need to adapt to is like what is different like we don't have that type of resource doing research and what are we going to do so this lecture is some history on things that lead up

to DPO that I saw that I think are important to remember and then really we'll kind of go zero to 100 and talk about Recent research that we're doing to try to answer this question and

Define what is happening so I'll start with a heavily abbreviated history of language models I won't go through all of this there's a bunch of this in the class already this is late

in the lecture I like to talk start with Claude Channon and then you skip a whole bunch of stuff where this Auto regressive loss function shows a lot of promise and this

was not fast you can see how many years that it took to build language modeling as a field here and deep learning is brewing in the background of one of many things that went into this

and then you have these years with like 2017 the Transformer paper that you hear about 2018 with gpt1 Elmo and Bert kind of these foundational topics in language

processing and how embeddings are created and then with gpt2 and scaling laws become this kind of key idea that people are looking at and tracking and

how these models are improving and then in 2020 is when people really started to wake up to how useful these large scale trained language models were at this time I was wasn't even a language

modeling person but for a lot of people in AI this is when the kind of gravity of the situation was starting to suck people in and there's a lot of cadence to these things in 2021 we had the

stochastic parrots paper which before chat gbt is raising the warnings of what are actually what are we actually putting into these models and what are they learning like are they actually learning something meaningful from

language or are they repeating the language that we have and this is a kind of philosophical debate depending on where you land on what language is what these language models are doing today

but it's important that it came out before chat gbt and it's like these foundations of debates of what language models are doing in 20 end of 2022 is when chat gbt actually came out which

was supposed to be this kind of quiet launch of a b like a demo from open Ai and it has since captured the attention

of the world that we have seen and the simple question is can C chat GPT exist without rlf I think it's it's important to acknowledge that so much of this is from pre-training but at every point of

the line and chat GPT and then a lot of these popular models since then rhf and these human related or other fine-tuning Technologies seem to be necessary but not sufficient like you need the

pre-training but you also need this kind of rhf or this post training to really shift the needle and what the most important models are at that certain moment some examples you can list so

many of them where rhf has replied relied upon I like to look at these plots from the anthropic constitutional AI paper where they kind of show this iterative Improvement of their different

rhf methods it kind of shows of how you have these multiple model versions that are evolving over time as you add more fine-tuning data this is a a dense paper but one of the most representative figures of kind of what rhf can do

there's a lot of information in here that you don't need to follow right now and then like meta's llama 2 paper is pretty funny where they're they have this quote this like reinforcement learning known for its instability

seemed as some shadowy field for those in the NLP research Community however reinforcement learning proved highly effective particularly given its cost and time Effectiveness so this is like this is from the technical report directly which I find really

entertaining is this is back in the day when we were like oh we don't know if rhf is really going to take off this is July of year 2023 is like in this building period and it's just directly

from the report and that's aged really well where people are still using this today but there's just a lot of interesting hints in kind of history of culture of rhf in the releases of these models where the people these companies

like to talk about it and give us kind of these cultural details to what's going on so I'm going to kind of go through some definitions and I don't spend too much time on saying doing rhf 101 and

like exactly what is happening with these kind of mathematical terms but it's important to get on the same page of what some of these things do and don't mean um there's a lot of definitions I think some of the

interesting ones that if they don't make sense right now to come back to is like what's the difference between instruction find tuning and supervise fine tuning I think like instruction fine tuning is what's become really popular where it's like you're training

a model to follow instructions and I have another slide on this after and supervis fine tuning is like this domain specific thing and we want to do both of them I think instruction fine tuning is

more linked to rhf it's about making these models really useful and really engaging and kind of easy to work with and then there's other things like alignment which is like super vague but

it's in the word it's align it's training a model to be mirrored to what a user wants and there's a lot of things that you can align to rhf is a mouthful which is one specific tool for doing alignment where

you have this kind of human feedback data which is like feedback is a really loaded word there where there can be like preferences and learning to rank is related to actually putting feedback on preferences there's a lot of little

things I tried to make preference fine-tuning a phrase at one point but didn't really double down on it I think it's a little bit clearer than rhf especially in the context of DPO but there's just these lot of spheres that

are overlapping in this kind of post trining or fine tuning space of models these days instruction tuning instruction fine tuning is the kind of it's still the foundation of a lot of

this this is where things called system prompts are added where we're like making the model ready for a specific style of input um open AI is still kind of inov innovating on this they have

this model spec document they released a few weeks ago where they said they're going to have like a second level system prompt here which this just adds some structure to how the models can take in data so that you can do a lot more of

this fine tuning down the line and how user data actually gets passed to the model or how the developer passes information that the user doesn't see so what this can often look like is

like stack overlow Reddit data where you have a a question at the top and then an answer and this is still I think a lot of what is happening behind the scenes there's a lot of data sets of stack Overflow out there Reddit has these data

Partnerships and this still uses the auto regressive loss function that we started with we haven't branched out into kind of different loss functions yet but it's still super important a lot of academic research shows that this is

like all you need in some ways which I I think is a much more mixed bag but it's it's the simple method and it's the right place to start and where we kind of go is then we go to this rhf

objective which this looks really familiar to people that are trained in reinforcement learning I think this is a little different to from the NLP loss function um on the left side is like the standard reinforcement learning

objective which is you're learning a policy pi to maximize some reward which is a function of something depending how you set of the problem and then on the right side is going to be this kind of

KL constraint this um it's a distance so that the policy doesn't change too much it's related to this whole idea of over optimization that I don't go into too much of this talk um but the key ideas

is that we want to optimize a reward but not over optimize it and the primary questions when doing our LF is like how do we Implement a reward function like what is our reward actually going to be and then how do we optimize it you see

this abstracted later as like we train a specific reward model and then we have specific policy updates and DPO direct preference optimization handles this a little bit differently so to get before

we get there it's like the actual preference model that people use for rlf is well I find this interesting like it's from this Bradley teror Terry model which is from economics in like the

1950s which is essentially a probability distribution over a pairwise choice and what ends up happening for various technical reasons is that if we train a preference model it needs to Output a

scalar value and by some coincidence that I think is still very convenient they just take the output of this learned probability distribution as a reward they say that okay our reward is going to be proportional to this

probability and it's going to work and it ends up doing so but that's like even a big leap to accept that it's like we have this par wise preference probability that's saying the probability that one answer is chosen

over another and then you have to kind of this mental crazy step of saying we just pass in one number or one piece of text and we're getting the probability that that one piece of text is chosen

over any arbitrary other one so there's a lot of like assumptions that make this there's like kind of deep Concepts in here but what we're getting is a model

that's giving us the score out and the kind of question is if we why do we have to do this and like what if we can just take our original objective and use gradient Ascent on

this equation Ascent because it's a maximum and this is really what DPO does I'm blurring through a ton of math it's a great paper to learn a lot of this math of language modeling where you

learn how these probabilities of different pieces of text are handled by the model and how it's ends up being a lot of these like log probability ratios and seeing how the prompt and the

completion are handled differently it's worth digging into and understanding the derivation but the core idea is like why can't we just do gradient descent or

gradient Ascent to solve rhf optimization and this is like it becomes be incredibly simple so if you look at the code on the right is the um

reference code from the original implementation it's extremely simple to implement and it has this characteristic where if you work with something like Transformers before it's pretty easy to

write a loss function that uses DPO rather than building an entire infrastructure stack to start with when you do something like a PO and this full rhf stuff that open AI does you normally

need almost entire new infrastructure stack but you can get started with DPO in a much much simpler way and there's some kind of characteristics that I'll get to later which is DPO still has a reward model which is really important

to the math actually checking out whereas you're using your original language model as a different type of reward model but that quickly takes us down a whole bunch of derivations that

is probably at least not the lecture that I think is as fun to give and the key thing is which is why this lecture is called what it is is that the first two points mean we'll see more DPO

models than anything else like DPO is where everyone will start with if they want to do alignment research and it's for good reason like it is the right place to start if you're thinking about doing this it scales more easily on

compute it's easier to debug it's even easier to learn so like it's it's not really worth second guessing that and it is a good place to start but it also leads into these

ridiculous conversations online where everyone is trying to figure out like is DPO better than other RL methods pop which is this older popular deep RL

algorithm which John scholman wrote reinforce which is a slightly different parameterization of policy gradient they're very similar and DPO ends up

being much simp like it's just simpler to work with so there's this meme where it's like if you you just do gradient descent it'll work in reality they're pretty they're they're different loss functions and they're doing very

different things but you can get similar results with both of them which is why if something is much easier to do you should just start with it and I come back to this much later in the talk which is like what is fundamentally

different about these RL algorithms and whe how your data is processed and where the signals actually come from but for now it's like we don't need to say one versus the other we can do both and they

are different so that's kind of the quick one-on-one of what the core ideas are I'm going to kind of take a path to where we how we actually got to training models with DPO

because I think this slide was from a different talk where this subsection is reduced from but DPO really came out months before we started getting popular models trained with it so it's like how

did we actually get to the point where the community was training models with DPO which is much more recently than the paper was actually released and this comes all the way back to these

first instruction tuned models that you saw so the alpaca the vuna koala Dolly of the world all in April of 2023 and these are all built on kind of similar

things and slight iterations so there's kind of figuring out how to use synthetic data building on this first llama release there's some other things that I'll talk about but this is where we started they're all using instruction

tuning most of them use synthetic data and what vuna actually did was they used this thing called share GPT which was the first time that people working in

kind of this academic alignment space had access to data that was from humans it ended up being a bit of a legal gray area because it was logging data that

people used in a Google Chrome extension called share gbt to like make it so chat gbt had a share button but this data was really important to things like vuna and a lot of the other models that came down

the line and is still used in models today as like one subset of the training data set so just having access to these human prompts was unlocked a lot of potential back in the day and is still

something that were seeing thankfully now we're starting to get data sets like this that were collected in more permissive ways like this kind of lmis data has prompts that are collected and

with consent and wild chat which was a project from ai2 which essentially gave people free access to chat gbt and exchange for their data the thing that came after Shar GPT

was the realization that we need more human data and this open Assistant project is one that um we honestly need more of it's It's shows how hard it is to create human data that we haven't

seen more things like this this was run by a few people in a Discord Community working extremely long hours to generate prompts responses and preference pairs to kind of common requests the language

models and this was from April of 2023 and we haven't seen anything like it tra gbt or lmc's data is similar but there's not the same level of controls and voting and ranking that they went into

this open Assistant data and it again is a data set that we're still training models with and many people still train models who I think come up time and time again so it's just like these one or two influential data sets from over a year

ago are still what are used to trade models so you'll you'll get the theme as I keep going there's actually rhf models

trained in April of 2023 as well um this was from Carper AI that was doing a lot of work in the space they've kind of Taken they've Fallen back a bit in recent times but there were people that

were doing the similar methods to what I'm going to talk about at the end of the talk that kind of knowledge and infrastructure was not translated into things that were easy to use so there's

also this vein of like even if things are open it doesn't mean that it's going to immediately catch on and be useful you have to have the resources the data and your codebase setup in a way that

people can build on it which is what DPO did really well this like RF model from Carper was successful it was better than the vicuna model but no one really built

on it right away which I always find confusing then kind of later in the year another key thing for this open alignment was the Llama 2 backlash where the Llama 2 was asked to kill a Linux

process it it would refuse and this kind of bred a whole series of models which are kind of called we are still referred to as uncensored which I don't think is the best name CU I don't think there was

ever actually any censoring to the model wasn't intentional censorship but the goal is to make models that don't refuse any request which is useful as a research artifact which is like what do you get out of a model if it answers

every question like what are the limits in that regard there are other ways to use that which are up to you but like what ended up happening is a lot of these like um shared gbt data sets

because they're from chat gbt there's data that says oh as a language model I shouldn't answer that so people started filtering all of that out and there you still see a lot of people releasing

these like uncensored models today as a popular area of development I think that we should understand what people need when doing research and researching a model that

doesn't refuse is reasonable but if you're to deploy a model for free use to users you should consider whether or not everything should be answered so it's like as a researcher how your artifacts

are used kind of depend on the work that you're actually going to be doing then in the alignments there's this long series I'm almost done with the end lens but there's this long series of models that are really

interesting to people like me that never really broke through the narrative where they're saying their things like we used rhf where the first model to beat gb4 on alpaca Val and these other V tools

they're scaling things up but they don't always have papers they don't always have code bases and it's like things are happening around I just like it's not just like the hugging face of the world

there's a lot of different organizations in the US and elsewhere where that we're aligning models and getting similar numbers or beating these kind of mainstream tech companies and these places that you look for models to these

so these are all in the summer of 2023 and this is kind of all this like these I bring these up because this comes before like the first big splash of DPO so this Zephyr model was really

the first model that I remember Making a Splash with DPO and this is when it took until this time which was in September after the May release of the paper for people to really be like o DPO is the

real deal like it took four months and now like the paper has best paper everyone uses it there's tons of derivations but in industry and people trying to train models there was a lot of skepticism until this moment so this

is like a classic academic story of needing to wait a bit until your um until your work is Vindicated in some ways but the two crucial things here was

new data set the ultra feedback data set which is a data set of um synthetically generated text labeled by gp4 so it's again this kind of new ways of making

data where it's a preference data set um we didn't make it it was made by um open BMB I think they're based in China and should know more and then we also just had to do a lot of experiments to make

it work there's a weird really low learning rate that was needed to make this kind of chat model work with DPO which is like 5e minus 7 if you're really plugged into AI you'll know that

like 3 e minus 4 is like the lore of the best learning rate so it's many of magnitude lower so that's kind of what it took to get this to work we probably could have done it months earlier if we

just did more hyperparameter sweet but like this is the random happen stance of the stories that people now like backcast as being like this is the super important bottle it's just it's it's

somewhat random and then at the same time I was switching jobs to the Allen Institute and they were already working on this project which is trying to do a systematic study of instruction tuning data along with some of this preference

tuning recipes that were coming out because once this Zephyr model came out there's always Skeptics of like oh doing it at 7B is easy like that's a small model so it's like oh is it going to actually scale to the real deal to

bigger models to be what like Chad gbt does so it was like okay we have some more compute and we tried it on this 70 billion parameter scale and we showed similar gains we all we did was use the

same Ultra feedback recipe the low learning rate and it largely worked so this is within two months and then this is when in since then is when there's

tons of new DPO models anyone all these startups that are releasing their own models will release an instruct version that is a DPO thing and that kind of continued for 6 months I think just today I'm starting to see less DPO

models which is interesting I've been keep tracking keeping track of them for another evaluation project and it has finally kind of slowed down a little bit I don't know if that's alignment at large but there is so many I I I should

add a slide that's like a list of the ridiculous amount of DPO models after them after these two but this is really when the floodgates kind of started and

when we're like okay DPO really works so this is kind of why I say like what comes next it's like we could retrain models on data sets that we have we don't have that many data sets but it

kind of feels like we're fishing in the dark like Zephyr was built on the success of needing the low learning rate this Tulu 2 model is actually trained on tpus because we have the Google tensor

research Cloud so we have bigger tpus to train these models and it's like how do we do this more systematically and that's kind of where most of what I talk about today on the technical matter is the recent research that we've been

doing to just kind of make sense of this and answer the fundamental questions of like what do we need to change about DPO is po better and so on so this is kind of the the the

reality that I back go back and forth in between which is we don't really have the human data to do rhf like industry but it is getting much easier to do alignment research so you can kind of choose your narrative I think sometimes

because I'm so close to Industry and hear about people have I'm like too often on this side but there is a lot of opportunity to do things it feels crowded but being crowded at this point

when there's so much investment is just because you're in the right area and most people in this room aren't trying to be professors so if you get scooped it's it's okay but um it's it's I find

it very fun and so like how do we actually understand what we're doing with alignment and can we improve on these models like we have to 2 it has a number because we want to keep releasing more models so it's like how do we get

better evaluating what we're doing to try to understand this process and then how do we train better models so these are the sort of things that I'm up to I have a few examples of things I've been

working on I built an evaluation tool for reward models I'll talk more about reward models to start here and we need better evaluation because when you're training models you need to be able to

do kind of what I call like local evaluation you need to be able to get a number that tells you if your training technique is improving the the end result you can't wait until chatbot Arena evaluates your

model because that takes you about a month to get your numbers back you need to be able to run something at your desk that gives you signal on if you're actually doing a good job and we're still pretty behind on those evaluation

tools and they're there are more coming which is promising and then given dp's Simplicity can we actually improve on that and can we catch on to some of the industry rumors that they've let it

drift aside so so reward bench is this project that I started because there are no evaluation tools for reward models my

motivation was mostly for transparency given how much industry says reward models are what you need to focus on they're really important for getting good models out the door and it's like what does that mean like what is like

what does it mean for a reward model to be good if we look at this kind of feedback diagram which is the one kind of homage to the RL background just

feedback loops um is like a word model is in this casee the agent is your actual language model Pi is the policy the training data is prompts that you

get so in this kind of um rhf framework you have this feedback loop where the policy generates something a which is the action which is the completion it goes to the reward model which then

scores it but you kind of on the side are looking at all these evaluation tools and it's like none of these evaluation tools are giving us internal insight into what's happening in this

feedback loop it seems kind of external to what we are doing when we're training these models so we really wanted to zoom in on this reward model and reward models are trained in a kind a another

kind of weird way the many quirks of rhf so in order to train a reward model you need to collect this pawise preference data if you're kind of using chat 2bt a lot you'll sometimes see it give you two

answers and ask you which one is better this data is literally what is used to train a reward model it's a a prompt and then two completions a chosen completion and a rejected completion but in order

to train these models you have to pass both of them in at the same time so you pass both of them in at the same time and it gives you two scalar values you use a language model that outputs a

scaler just by some modifications of the last layers rather than outputting text and then this L function I'll show you on the next slide is essentially why you need to you need to use this batch mode

idea which is you pass multiple things at once and you get multiple numbers out so this L function is ESS here this R is the output directly from the reward model for the rejected

completion and the chosen completion so you're trying to separate the distance between them and then automatic differentiation kind of updates the parameters so that this distance is bigger so you can't just kind of do

supervised learning directly on one thing to say for the reward model there are alignment methods researching that now but it's really built on this idea of separating two things and creating a

margin in the prep references to kind of learn the decision boundary there's a lot of really specific details in Industry such as these models are only trained for one Epoch they get really low accuracy scores when you compare

them to other kind of train test set things in machine learning and there's some additional tweaks that people do you can do ensem ensembles lamud did

this weird margin loss but none of it really trans is transformative and how these models are trained they're in this weird place where you can only get about 70% agreement with your annot

it's it's kind of the sort of thing of is the noise part of the signal or is it a bug so like in preferences it could make sense that it's a signal because not everyone's preferences here are the

same so not getting full of agreement be like this system might be working we don't want chpt to be fully narrow-minded all the time and this kind of reads to the thing of like how do we actually evaluate

these reward models that I was talking about I hear all the time that reward models are crucial to rhf but um how do we know exactly what types of the final policy they're improving should we

include safety in these reward models how does scaling laws impact reward models there's kind of basic machine learning questions it's like can we evaluate these what should we think

about so what we kind of what we did is we collected a bunch of prompts and then we manually created Chosen and rejected answers for each prompt and then we can see whether or not the reward model

agrees with our human created data and call that like a win or loss in an accurate point of view it's really direct we're just doing inference on existing models and we're going to see

whether or not they agree with human data and this is a slide if you want to go into the academic side of things this was built on a lot of existing evaluation tools that were out there

you'll see some common names alpaca Val Mt Ben are things that you've heard about EXs test was on the slide when I mentioned llama 2 being um overly safe and there's some other things that are

really good but you might not heard about like um this llm bar data set from Princeton is a bunch of trick questions that I'll have an example on later and some kind of normal names from anthropic

and open AI in here as well so there's a lot of different things that we're testing with this data set and then we're trying to get the full picture of like what is going on with these

models we released this in March of 24 and you can see a key in the bottom where these kind of um red circles with the arrow in them are DPO models which you can use as a reward model

and then um this kind of these dice which look like gray squares when you zoom out are what I described in this kind of um classifier type of training and you can see that there's reasonable

scores The Benchmark isn't saturated bunch of open models some names that you've seen before like the Tulu models and the Zephyr models are on here kind of normal stuff we like this

is what we expected it's not too saturated but if you look here I'll show you where this model has moved in a few months so today we have a lot more models and there's a lot more information here so I get to tell you

about more interesting things which is like how open AIS and coheres models do on this which is like I mentioned wanting to do this for transparency but we also add new type so this is where the fifth model ended up so in two

months the model that was fifth on your leaderboard is now 31st so we're trying we're getting the saturation from people doing research in the area to actually have places to compare their

models and but we also have models from some closed labs and I'll kind of get into the details here so like some of these are labeled as um are different types of models with

is llm as a judge um llm as a judge is the idea if you can ask a language model which answer is better this is kind of how things like alpaca Val and Mt bench

are built but you can also use that as a reward model I told you that I have prompts and then Chosen and rejected I could just ask chat gbt which one is better and see what it does and this is what we added in as a Baseline and this

ends up being really interesting because GPT 4 and gbt 40 are not actually as good in this closed domain as a reward model that coher is training so we don't

have full information because we don't have Open the Eyes reward models but we can use their models to compare so we have a lot of different information going into one system about how language models and different parts of the

alignment process choose different categories so I'll kind of and kind of go back and you can see like this Co here is across two different months theirs has improved a lot and then these

kind of earlier DPO models that we saw higher up on the leaderboard have been shifting down by more people training reward models to begin with and the specific category that I'll

Focus most on is this kind of chat hard thing um if you think about evaluation a lot it's actually surprisingly common as a topic covered in kind of tech coverage

is how evaluation is saturating this is the one feature of our Benchmark that hasn't fully saturated and it's really important to kind of having some sort of longevity to The Benchmark and I'll talk

more about this kind of as we go from here so I mentioned this data set and it's interesting to understand if you could actually do this problem so what

we have is a prompt a Chosen and a rejected and the prompt is give an example of a metaphor that uses the following object stars and then the Chosen and rejected are two similar

metaphors but one of the like you can you can see if you read these what the differences are I'm just pausing for the people that are still that are paying attention to reading these but essentially what

happens is that the chosen one is about the sky and the rejected is about the moon or yeah so the twinkling diamonds in the sky see I haven't messed it up reading the slide but it asks for stars and it's about this kind of metaphor of

stars where the rejected is about the moon which is also in the sky at night and this data set is a whole bunch of things like this where what they do to create this is they either manually or

by chat GPT ask the or ask to rephrase a prompt and then you create a new generation from it so you can kind of get these rejected Generations that are just off topic and it makes sense for

something that would be really hard for language models because they have this association between the stars and the moon but we want our language models to be able to answer questions like this and this is the type of thing where our

reward model Benchmark which is something that is training language models has the best correlation as something that is hard so this is promising there's this is the sort of

thing that you if you're in research is interesting so it's really in the weeds but it shows that we still have things to learn about these models and there are things that we can't do yet but

another interesting pattern in safety I mentioned this kind of um uncensored models and in safety we see all the patterns we would expect the breakdown at the top of this table refusals is

things that we want the language mod refuse and then this excess T Test data set can be split into something that we want models to refuse and we want models to respond and you can kind of see that

there's multiple categories of either DPO models or reward models where the model that handles safety really well refuses things like asking for advice on

causing harm and responds to something that is borderline but there's actually a lot of models out there that just refuse everything so that'll tank your score on things that that response um to everything which is kind of the safe bet

we were seeing a lot of like tech companies release models like this which it just feels like you just it doesn't feel right when you talk to them but there's also the models that just respond to everything it's like not my

job to gate whether or not I should it's not like not the language models job to gate the question is the philosophy there which is something that we hear a lot about in the discourse of alignment

but to see it in these reward models and DPO models when directly probing them at the without asking them to generate text is nice to be to confirm a lot of

suspicions that we have so this is back to some of the DPO math which is again good to know so if you are to go into the DPO paper you'll see equation three

here which is the reward that is defined in order to make the math actually work and this is very different than just outputting a scaler it ends up being a ratio of the probability of the policy relative to the original policy during

training which is called the reference model and this is an imp it's like it's a very complicated mathematical representation so if you actually take a piece of text and pass it through a DPO

model the reward will be something like minus 200 or something because it's a bunch of log probabilities probabilities are between 0 to one you take the log you get negative numbers and you sum all

of these up so you got a big negative number and that intuitively is like the score of that these models are providing which is very different than the other type of reward models that I talked

about training earlier and if you have two prompts with a Chosen and a rejected equation 4 is the math that you actually need to do to um decide whether or not

one of the answers was better you're kind of comparing these ratios of probabilities from two different models with respect to this reference model which was the starting point of training and the question is when people release

a DPO model they normally release a model and they don't release all the intermediate checkpoints so this reference model would be an intermediate checkpoint and the training process so the question is like can you do this can

you use it as a reward model if you don't have access to all the information and the short answer is no which is all the scores on our Benchmark plummet across all the DPO models that we have

which makes sense because this extra model is a regular regularizer in the probabilities or it it's in the actual reward equation if you go back a few slides like it's in the equation so if

we what we do is we get rid of this and we stop normalizing equation 4 and we just see if it works and it doesn't but this is important because DPO is

training a reward model but if we don't always have access to it we just we just can't learn from it we can't use that in another system as clearly so it's just a lot to ask for when getting people to

release models and this is a interesting slide showing coheres kind of progress on reward models in just a few months they released something that was clearly

state-of-the-art on our Benchmark a alignment lab um they this kind of RL rhf flow work release something in May and then just a few days later coher

sent another number of those like here's our new model it's still better than everyone else so it's nice to kind of have this academic industry intersection but it's very rare and takes a lot of

work in terms of networking and building relationships but we're trying to do it at least in these small niches where the companies are willing to share reward bench 2 is going to need to

just mostly make everything harder and make everything more human and kind of the last point is what I'm going to trans into next is like everything I've told you about is about part of this rhf

pipeline but I haven't told you how it is impacting the final model that you use at the end of the day which is very rightful criticism which is like if you're evaluating part of the alignment pipeline you should be telling me

whether or not the final model is actually useful so this is kind of where I talk about our journey into trying to train um po models so we're trying to fine tune a good model we spent a lot of

time on DPO with this tul2 work and we wanted to know if we could do better by switching to PO so this is a lot of um it's not yet published work but it's going to be out soon so the numbers

aren't entirely final but we're just trying to disentangle what the difference between DPO and PO is at a very empirical level so um we're trying to answer if it's

better or not so what we're going to do is kind of walk through a series of design decisions and see how it affects the suite of evaluations we're starting with this llama 2 13B model and that has

already been instruction tuned the difference between the blue and the red is the gains from instruction tuning for these kind of um reasoning coding chat tasks instruction tuning does the biggest Delta that you'll see among all

these slides instruction tuning kind of puts the model on the map as being useful and it is easy to see gains at the beginning and then it's harder and harder for us to really keep improving

these models so we start with is we add this anthropic um helpful harmless rhf data with DPO and you can see that there is a small bump across all the metrics

that we did this data set is known as being particularly noisy among researchers in the area but it is kind of the starting point when you're doing research on alignment it's been around for a few years it's big it's multi-turn

it's it's but it's known to be noisy and it still gives Improvement and then what you do is if we switch to this data that was used for both Zephyr and 2 through2 officially this Ultra feedback data um

we get an even bigger bump so this is just kind of showing the difference that changing only the data can give you in a DPO recipe it's normally increases of

kind of like 0 to 2% and in the research sphere of trying to ship a model that's a big deal so this is kind of where we Triad it into new territory grad students

worked really hard and implemented Po and Jacks in addition to what they already had and we were like okay what happens when we add Po and require

reliably across m multiple experiments it's this is one example with the 13 billion parameters po just happens to do a little bit better it's like like 1%

better and we try to change a lot of things and the changing things is where things are get a bit Messier so we've heard from industry that using a bigger

reward model can be really helpful to getting a better policy model essentially these bigger reward models will be better at Nuance they should give more label better scores which are

used as Rewards they should just kind of make this process a little bit more stable if we have the compute for it we see that it does improve some things but it doesn't actually make the model

overall much better it's kind of flatlined with like pretty similar data and then just at making the reward model bigger which is a little bit surprising to us and we Al this is like this is the

most this is the most realistic few slides of the talk but it's like we did this thing where we took the we even were trying to see if our reward model training was bad as we scaled it up so

we used reward bench on the right which I had told you about earlier which it's not clearly correlated whether or not these two 13B models or 70b are better we also did this best event sampling

idea which is if you generate a bunch of completions from the language model you can rank them by your reward model and then re-evaluate on the top to rank completions that shows that our reward

models are better at the bigger scale but we couldn't get this to really click into a like a downstream model in a p notion of the world um we even tried adding more prompts to rhf we added more

code and reasoning prompts because that's something that open AI talks about a lot it's like and we want to improve our models on um it doesn't really shift the needle on this kind of

cohesive average over many tasks in the paper what you'll see when it's out is it shows that we added prompts really similar to two math and code evaluations and those specific evaluations got a bit

better but adding the full noise into the fact that some other valuations might go down makes it this just like this process is really hard to disentangle and this is why it's like

we're getting the 0 to 2% Improvement out of Po but DPO doesn't have this this sort of mess so what we ended up getting to is like there's always one more thing

for us to oblate when you're trading these models with po the sort of things like different regularization we're learning a value function in RL

different warmup different size par like there's just so many knobs to turn in Po and it was reliably getting us a pretty good model but it's like we're staring into the abyss trying to improve

this right now in the next few months and the bottleneck at in terms of the actual technical side is that PO generates new responses from the model

as it trains to kind of refresh the data and that is by far in a way the biggest bottle neck when you're actually training these models is it's just way slower than

DPO so all these resources for po things are somewhat available to academics the Google tensel research Cloud I think is pretty available the grad students I work with seem to sign up um the code

base is open so if you're interested in a grad student and you're trying to do po alignment and have access to tpus please get in touch it's it's a very fun

can of worms but kind of as a summary like this is the many different DPO data sets that we tried this is almost all of the well-received data sets that are out

there in the open and they all look at like the factuality column like some of these things just don't matter at all when you're aligning these models so like we need to get new data sets that

are really adding different capabilities to these models and something that matches these kind of ultra feedback numbers at the bottom and I don't I

don't like I'm surprised whenever I look at this but this is where we are at and we need to try to keep building data sets and keep adding freshness to this system Ultra feedback at this point is

maybe 6 months old or so I don't know the exact age but in terms of people training models that that feels old to people to things that are happening um and these are the actual sort of numbers

that you get when you compare DPO versus Po this is all with this 13 billion parameter again we changed the data set and every one of these poo comes out a

little bit better on average and this is a few grad students and people like me this is not a big team in Industry doing this like we're scraping by and I don't

know if it's worth the effort if I see why open AI uses this because we able to get a bit more signal out of it but it's a ton of effort to get a bit better um

signal out and I'll kind of transition into a bit more of a like open-ended discussion of this and then we'll have questions but it's like what about PO is

actually special like this generation and this online nature and like can we just change DPO to be like this or like where are the new things going to go and I had the pleasure of advising one

project that was related to this but this is much much more General so it's like what is special about online data there's multiple ways that you can get new data into your

rlf process and then there's also this related question in reinforcement learning literature which is like on versus off policy which is a technical

distinction that often gets looped in with these discussions of DPO versus Po they're actually related but the reinforcement learning discussions have

a very much more like definitional flavor to them while in this alignment space we're more focused on if we need to get fresh data in and how we need to label our data for language models so

I'd make this distinction between these two things which is freshly generated data from the policy if you zoom into a data set like Ultra feedback it has Generations from all sorts of models

from alpaca Von kuna GPT 3.5 GPT 4 llama like there's Generations from all sorts of models in this data set we are using so when we train these Zephyr these 2u models we're incorporating information

from a lot of different models down into our one policy whereas what PPO is doing is only generating data from your existing model and kind of changing this distribution over time so like that is a

very different idea of where the signal is coming from from the models and then the second thing is whether or not refreshing the data labels over time if I have human labelers comparing Chosen

and rejected that's one data point but I can also later on take this reward model that I trained and generate a Chosen and rejected and change the label so these kind of two things of like what the

actual text is and when the chosen rejected label was given are what people mean when they're talking about like is something special about online in rhf and it's much Clear it's clear to see

that PO does it very differently than DPO but we're not restricted to this in the last few weeks I have the dates all in here so um April April in

May of 2024 there started to be a lot of papers on this about DPO po online offline and they really kind of say

similar things which is that online is important and these papers on this slide they show these kind of more theoretical and closed form experiments on like what is special about online data and what

performance drops if you use this kind of offline data it's good to dig into these but it's this is why I say it's like nice to do research now because if you have an idea a lot of times people have like three papers that confirm the

notion that you have it's a lot easier to be confident in things if three independent institutions say something similar at the same time there's a lot of methods coming out where people are

trying to modify DPO to actually use this kind of online notion I think self-rewarding language models for meta was the first really popular one where they just had they asked the DPO model

hey which of these answers is better in between each iteration so they did this like llm as a judge to relabel their own data and then they did multiple iterations of DPO and the model had

really strong stores there's now ideas like not using all of your data at once so you can kind of do batches of DPO and update your data the paper that I was on with this discriminator guided DPO which

I'll talk about in a second is using reward models plus this DPO training objective there's just a lot of things that we can change and I think the community again is in this expansion phase where I even get messages from

people are like oh my paper was really similar to this other paper they that we did it first they didn't site us and I'm like this is kind of the point but it's hard it's like it's it's going to be like this for a little bit longer and

then hopefully in the end of the year in a few years we're going to be like okay this is clearly what we need to do on the method side of thing so this is one

example d2p discriminator guided DPO which I'm is an advisor to which is a undergrad researcher and the idea is comparing these three different things

so like a is the standard DPO you have a data set you apply the loss function on it B is what we call some sort of online preference optimization which is where

you can repeatedly label your data with a uh reward model it just kind of like the self-reward paper that I mentioned which is you can read shuffle your preference data based on a reward model

and that kind of adds some notion of online to your data and then the third thing is like what if we're relabeling data and we're retraining our reward model over time so we're just really

trying to keep our um kind of what our policy is doing related to our reward model and keep everything really updated in real time so that it's all it's all lined up and this is wondering how much

of a gain do you have by retraining the reward model over time in a DPO framework and part of why I like this paper is there's things like closed form tasks so

the biggest question that I get for alignment is like how do we actually evaluate it like what tasks is it good for there's a whole philosophical discussion where I think information transformation is a valuable task

writers tell the same stories in different ways but the best told story is the one that resonates with people that has value and but at the other time we're academics and we need to be able

to measure things so this paper has things like your reward is counting the number of nouns in a sentence and then you're using these alignment methods to increase the number of nouns in the outputed sentences from the model so you

can measure that a lot better because we have classifiers which know our nouns and you can see on this left figure is that just by retraining this reward Model A few times and it converges

better than if you were just to relabel your preference data it's a mouthful but it's just like keeping your model your training process a little bit more online can improve a performance and on

the right is a more standard open-ended evaluation task where we're asking a language model like chbt which answer is better and that has all sorts of

problems but we can show similar results I think the big takeaway is really like these few slides which is the the the literature is moving we have studies that show that online is better and

people are coming up with really cool clever ways to actually use online data so I would I combined with new data sets this is kind of this the like deep of this year is like online methods and how

they work so this kind of goes back to what industry is doing and I showed this figure earlier on the left with Claude where you can see the little points along the lines and these are these

different iterations we don't know exactly what they're doing but it seems a little bit different where the dots on these figures are new data sets from humans rather than this kind of redo a

reward model relabel your data this is what happens when you have access to different type of scale the Llama 2 paper makes this much clearer they say they work with an annotator they get batches of data when they're generating

this new batch of data the previous models checkpoint was used for Generations they do this many times and you can kind of see that they're collecting new human data new human data

new human data and each time they generate human data it is trained for a new model they're doing a lot of training updates and they're kind of building on each other and this kind of

leads into the last section that I'll talk about in the conclusion is like what did meta do with llama 3 this is one of the most funny blog post sentences it's like the ridiculous

things that they give us and then we parse the tea leaves um they say in the blog post is that our approach to post training is a combination of supervised fine tuning rejection sampling proximal

policy optimization Po and direct preference optimization so it's like the people ask me like what the heck did they do it I mean I kind of agree but it really goes back to this slide in my

mind which is that they're getting new data and then they're training a new model over time so what I think is happening at each one of these points they you they tried a

few methods and they chose the training method that worked best it's really it's practical meta is a really practical organization especially in the Gen org right now and that just makes sense it's like at different points in the model

your model has different capabilities and it's ready to be trained in different ways rejection sampling which I didn't cover here is the simplest Training Method you take a reward model

you rank some supervised fine tuning outputs and then you use this autor regressive loss function again and then from there DPO is much simpler to PO but

it might not be give you the highest end performance and then as your model really starts kicking into gear or you have more time to train this model once all of your data is collected and you're not on a weekly time crunch um you can

experiment with all the little knobs of Po and you can really try to get the best model out at the end of the day it's just hopefully they release a technical report that confirms some of

my hypothesis but I think this is normally what people are interested in when somebody from industry comes up to give a lecture and it's it's I wish we

had more details on what industry was doing but in terms of current directions that I'm most interested in rhf I talked about data a lot we are very

bottlenecked on data even as academics with very limited compute we literally try every data set that is available like that's not like we don't have a lot of compute but we need to keep

innovating there we're going to see more DPO methods it's it's here to say there's a ton I didn't cover here things like removing the reference model

changing the loss function slightly um not using pair wise preferences but single wise preferences it's a lot going on there we should use more model sizes

in 7 and 13 billion parameters or in llama's case like 7 and 70 billion parameters particularly scaling down is very useful it's a place where Academia

can still play there's kind of less of a weird marketing Dynamic where all the companies are racing to go bigger for certain um strategic reasons but this is something that's accessible to many

people aligning small models it's hard to get signal out of them because the models show more or less random scores on many benchmarks that people care about or really low scores so even just

kind of breaking through in that domain would be really impactful work to kind of get more people working on alignment and then kind of evaluations I covered at length which is we need to keep

getting more specific on things we care about and personalization is something in alignment that I didn't cover in this talk but is something that is good to compete with this kind of big Tech which

is like how do we train models that are good for you as an individual rather than one big model for one big technology organization so this these slides will

get to you but these are the types of places that I follow when I'm trying to see open models or open data sets that are reputable and easy to keep track of so you don't have to try to follow um

everyone and I write about this a lot without doing too much self-promotion but I have like I ended like 10 minutes early for questions that I'm happy to

take um in a Q&A format and then that if you don't have to stay in waight if you don't want to

[Applause] okay thank you Nathan um questions anyone got

questions assum you're hand a good reward model which is a large assumption I agree but what is the key challenge to doing online D in sense you can do en roll outs and then just like rank them

using a model and then go and you can iterate this so what like what is the hard thing yeah I'm going to repeat the questions so that people can hear them

and it gets recorded the idea is if you have a good reward model what is stopping you from doing online DPO and kind of just improving the policy from there I think there's kind of multiple

angles to this that they're both Technical and like the kind of industrywide but the technical thing is I think the prompt matching

ends up being really important so prompt matching so what your reward model can learn is specific to the prompts there're a technical detail where the prompts used for your policy often are

exactly the same as your reward model in po which is really strange because we talk about generalization in machine learning but we're kind of like soft balling oursel at the PO stage which is

we're only grading po answers which our reward model is train to answer which is kind of strange so people think that some of that might break down and we see

some of that when trying to train po models with off-the-shelf reward models it's was kind of a long answer and then but I think that I think that's

mostly it's like mostly distribution matching if I had to guess but if we had truly a good model it should work for some things and that could be one of the reasons why there aren't that many in

the open because it would kind of help people catch up in alignment it's like a reward model if it is as important as people say it is it might be easy other questions yeah

[Music] for examp me yeah I think there's there this is a whole conversation so if I don't cover it if you want more after I answer I can

you can come up but the question is like is there more than pairwise preferences that could be used in rhf and there's a lot of different lines of work that are studying this um one is methods like

there's a method out of Stanford that's kto name like csky I always mess it up with these names are so hard to pronounce but it's like idea of using one-sided preference data so a lot of

customer apps have like did you did you get good support from this agent yes or no and like you could use data like that is it just is a different loss function for using single um single side of

preferences or just yes or no there are other things like learning to rank for multiple answers so this is something I slightly insinuated but like binary

preferences is kind of like there's a lot of literature on learning preferences and one of the models that got reduced down is the Starling model and they use

a kwise preference so they have like five or nine answers to every prompt and then they collect answers and then they have a different loss function and this is one of the models has kind of like broken through in the open alignment

space it's one of the few that I left in and skipped in my slide deck but so that's kind of interesting and then there's other research that's like fine grained um preferences so for every

completion to a prompt you get labels like conciseness helpfulness honesty so there's a few things on that regards there's like a steer LM paper from

Nvidia and then there's work from udub that does like learning from fine G grained preferences so that one's probably like the one that's most emerging most in the academic sense but

there's so much to learn here there's like all like literally all the field of social Choice needs to get condensed into these things any other [Applause]

questions yeah so the question is how can we broadly is like how can we exceed Human Performance with um fine-tuning or any training for that regards I think this is where some older ideas in CS

will come back like I think one of the foundational ideas in CS is search which is really also motivated as like exploration in RL and therefore we need to have some sort of language models

that can search and generate new data I was talking with somebody before the grad student and I think that it's like search will be a large part of synthetic data but then the human aspect will be

what gets it across the line if it can't solve a certain area and this is like the qar rumors are ridiculous but that seems to be the best argument for the

sort of thing that open AI is trying with that is like how to get that barrier broken with AI thank you so much for coming in you

mentioned data sets for a big limitation and I was curious how one goes about creating a new data set yeah this is another thing that's hard I think Community efforts are what

people have tried to do I mentioned open Assistant but most people that do a community effort are like I never want to do this again so while I still think it's worth doing things once that are

highly impactful even if you like might not want to do it again other avenues for building these in a sustainable manner are very important I think that

there's some way that this is being done like chatbot Arena returns some of the prompts and the labels to users there's specific concerns I have with that data

around being too noisy um but that is the sort of thing that can happen if AI 2 has a demo for their models it's going to be about science and like generating information rather than being a chat GPT

competitor it's like a nonprofit it can't do a product competitor but that's the sort of data that we would want to release and something that I might just have to do but I'm interested in like

academic workshops and competitions as a ground where you could have communities meet every 3 6 8 months and have work that's focused on an area Andor like Focus time to have people contribute to

it but it's a good question it's not it's probably why there aren't very many how do you feel are subject to reward hacking as well so we get one at the front first

yeah close first and then we'll come to you um the various places you've done research at over the years do you have any sense of how they compare in terms

of uh specifically alignment research I mean obviously they weren't doing alignment research specifically add those time I think generally they represents

different culture and Investments of the company like the my I wasn't doing language models until a time at hugging phase so I can really only speak to these kind of two open

companies and from like hugging P's perspective is to show that more people can do this like we're not trying to compete with chat 2bt but we're trying to Foster an ecosystem of doing this and

ai2 is similar but more about like what is happening like how do we learn about this how do we do science how do we study the science of this and communicate that clearly and I'm sure if you do the exercise you can map this to

every company is like what is their important thing and like they have different goals in their products and their corporate structure and things like that I will talk more when not

[Laughter] recorded okay up the back are reward model also subject to reward hacking like they achieve a good

result on the outcome but act in reality the outcome does not expected yeah so this is like when talking about reward models this is probably the most established line of

work the question is like are reward models subject to reward hacking and reward hacking is a classic problem in RL I should bring back from my RL slides where you have the boat swimming going in circles and then be like this happens

to your language model and and what happens but it is and there's a lot of research to mitigate it but it's a fundamental problem which is you have a very powerful Optimizer and you have a

incomplete representation of your reward and it will always find where your representation of reward is wrong so it's like we will always be doing the best we can but I think saying it's

perfect is like not possible in the math I mean I can also say like the ways that it fails are pretty funny cuz like if you train these models you'll end up with a model that just says JavaScript to like every answer to like for on

Infinity it's like sometimes it's really easy to see when that is happening which is which is good or like you could change your loss function so that it will always exploit and like it's a good way to kind of make sure that things are

working which is you should be able to easily exploit if you turn the brakes off okay any last public

question if not uh thank you for Nathan for giving this call and if there's anything you'd like to ask off the Record um he'll be here for a bit longer

Loading...

Loading video analysis...