Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 9 - Recap & Current Trends
By Stanford Online
Summary
Topics Covered
- Transformers Expand Beyond Text
- Diffusion Models: A New Paradigm for Text Generation
- Data Curation is Crucial for Future LLMs
- Hardware Innovation for LLM Efficiency
Full Transcript
Hello everyone and welcome to lecture 9 of CME 295. So as you know, today is a kind of a special day
295. So as you know, today is a kind of a special day because we're having the last lecture of the entire course. So the
menu for today will be a little different compared to usual. We're going to try to divide the lecture in three parts. So in the first part, we're going to recap actually what we did in the entire class just to see how different pieces kind of fit together. In the second part,
we look at some topics that are particularly trending in 2025 and what we think are going to be trending in the near future. And then the third part will be more way for us to just conclude and next steps for all of you. Does it sound good? Cool.
So with that, we're going to start with the first part, which what I mentioned is about recapping what we did this entire quarter. So
nothing new here is just a way for us to piece everything together. So if
you remember, a lot of weeks ago, I believe it's like maybe 10 weeks ago, we had lecture one, which was focused on understanding what transformers were. So at the very beginning of the class, we didn't even know how we
were. So at the very beginning of the class, we didn't even know how we could process text. So I guess the first step that we saw was this tokenization step, which consists of dividing the inputs into atomic units. And so here, the way we divide the text
atomic units. And so here, the way we divide the text is something that is arbitrary in some sense. So we have different algorithms that allow us to do that. And we saw that the most common tokenization algorithm is the subword level tokenizer. And we saw that some of the advantages
were that roots of words could be reused and leveraged, especially when it came to representing those tokens. And
speaking of representation, Once we were able to divide the input text into atomic units, aka tokens, the next step for us was to learn how to represent these embeddings. So if you remember, we saw some methods that were very popular back then. So one of them was called
Word2Vec. And the representation was learned from a proxy task,
Word2Vec. And the representation was learned from a proxy task, which was something like predicting the center word or predicting the context words. But then we saw that this way of learning representations
context words. But then we saw that this way of learning representations had some limitations. One of which was that these representations were not context aware, meaning that if a word is in a given sentence or in another sentence, they will both have the same, like that word will have the same
representation in both sentences. And so for that reason, we saw some other methods that were popular in the 2010s, one of which was RNNs, if you remember. So RNNs had this recurrent structure, which processed tokens one at a time, and kept an internal representation of the sequence
so far. But then we saw that a big limitation of this was this problem
so far. But then we saw that a big limitation of this was this problem of long range dependency. And in particular, the fact that tokens that were included far in the past were not quantities that were able to be kept, I guess, as the sequence got longer. And this is the
reason why we saw the central idea of this whole class, which is the idea of self attention, where tokens can actually attend to one another regardless of where they are placed in the sequence. So you
can think of this as a direct link. And so this, for instance, is what we saw. We saw that there are like three main terminologies that people use. So query, key, and value. So typically you want to know how similar a query is compared to the keys in the
sequence. And you quantify that by taking some dot
sequence. And you quantify that by taking some dot products that's kind of scaled and softmaxed. And then you have the corresponding value that is taken. So at the end of the day, we obtain some kind of weighted average of all the tokens that are in the sequence.
And then you may also be familiar now with this formula. So softmax of Q K transpose over square root of D K times V. So this is the matrix formulation of what I mentioned here, which is able to process these computations in a very efficient way. And so it's something
that today's hardware is well equipped to do. And then we finished the first lecture by going through the architecture that is the foundation of modern day LLMs, which is the transformer. And we saw that there are two notable parts in the transformer. So one was the encoder in the left part of
the figure, and then the right part is the decoder. And we saw how this was applied in the case of translation. So at the end of the first lecture, we saw what motivated us to end up with the transformer. And we saw that the transformer was working quite well
in the case of translation. And so in the next lecture, what we saw was what were the little improvements that people have made to this architecture.
since it was released. And if you remember, it was in 2017 that it was published. So one particular improvement that
published. So one particular improvement that people have made is in the way we consider positions because in the original transformer paper, positions were encoded in an absolute way as in each position
had its own embedding. And this embedding was added to the token embedding. But then if we think about it,
embedding. But then if we think about it, positions, actually, we don't really care about the absolute position. We care about the relative position between tokens. And in particular, we care about how far tokens are in the self-attention computation. which is why we saw this
methods that is now quite popular called rotary position embeddings, AKA rope, that is now quite used. And it is a method that rotates query and keys, both of which happen in the self attention computation. And so here, what is
computation. And so here, what is quantified here is purely a function of the relative distance. between
two tokens. And not only that, it is something that is taken care of in the self-attention layer, which is what we care about. So this was one big improvement. And then we saw some other improvements, especially when it came to how the multi-head attention layer was composed
of. And in particular, we saw that it was possible for us to have some
of. And in particular, we saw that it was possible for us to have some groupings of the matrices that we learn. So we don't need to have one matrix, one projection matrix per head for let's say keys and values. We can actually group them. So this is for
instance, what is mentioned here. So group query attention. And then we also saw some other techniques that I have not represented here. Like for instance, normalization layer in the transformer, which here happens after each sub layer. But I guess nowadays people have tried moving the normalization piece
layer. But I guess nowadays people have tried moving the normalization piece before the sub layer. So here it's the post-norm version. And then the before sub layer part is called the pre-norm version.
And then the last thing that we saw was that from this transformer architecture, there were a lot of derived models that were based from that. So we saw that if we only keep the encoder part, we could compute very meaningful embeddings. If you remember, there was this kind of landmark
meaningful embeddings. If you remember, there was this kind of landmark paper on encoder-only model, which is BERT, which was heavily used in the context of classification because it relied on the encoded embedding of the CLS token. So that was one. But then we also saw that there was
a number of other kinds of models, all more or less derived from the transformer.
So you could only keep the encoder, which was for BERT. You could only keep the decoder, just for instance, for GPT. And you could also have both, which is, for instance, the case of T5, And one particular aspect of each of these models is that encoder only is not able
in the way that we saw, is not able to generate text, but is able to generate embeddings, which can be used for downstream tasks. But then encoder decoder models like T5 or decoder only models like GPT, they can be auto regressive and generate text. The paradigm can
be text in, text out. And
with that, we then focused on what now everyone calls large language models, which are transformer based models, specifically text to text models.
So decoder only transformer based models. And we saw that people have come up with a lot of new tricks now because, you know, these models as the name indicates, people have scaled them up. But
then one question was kind of thrown, which is, do you actually need all these parameters to just do a forward pass? So we saw one kind of variant which was
pass? So we saw one kind of variant which was based on mixture of experts. So what mixture of experts are is instead of running everything through the whole entire model, you're going to instead have a number of experts that you're going to
activate in a sparse way. So for instance, for one input, you're going to just activate just a subset and then for another input you're going to activate another subset so that you don't need to do all the computations all the time.
And we saw that this mixture of experts, they were used in LLMs in particular in the feed-foral neural network layer. So here you would have experts as being different feed-foral neural networks and you would have a gating mechanism that would reroute to the correct feed-for-all neural network.
And then we also saw that some papers were also able to produce some nice visualization in terms of, I guess, which token gets routed to which experts. Because
this rerouting, we saw that it was done at the token level. And so
one reason why it's done at the token level is to be able to, I guess, smartly put the experts on different pieces of hardware, different GPUs, and then kind of parallelize the computation a little bit more. And then
we also saw that these LLMs, they always are tasked with predicting the next token. And in order to predict the next token, we were interested in I guess how we were doing this. And so one particular method that people use is just sample. Sample from the output
distribution. So you have, let's say given an input, you have a distribution of
distribution. So you have, let's say given an input, you have a distribution of probabilities of what the next token would be that is output by the model.
And what you do is instead of, let's say taking the highest probability, which is called a greedy decoding, greedy kind of decoding, you actually sample.
So it introduces some randomness and allows the model to produce a bigger variety of kinds of outputs. And we saw that you could adjust how much variety you want in your outputs by tweaking a hyperparameter called temperature. So very low temperature leads to very spiky distributions, so
temperature. So very low temperature leads to very spiky distributions, so more deterministic outputs. And higher temperatures are, I guess, a bit more random, a bit more creative. Okay, so until then, we saw what LLMs were, how they were based on the transformer, how they
connected to the architecture that we saw in the first lecture. And then in lecture four, we saw how people actually trained those LLMs. Because as I mentioned, these LLMs are large, and so you cannot kind of naively fit them in your hardware. You need to be a little
bit smart about it. So in particular, what people have kind of noticed in the early 2020s is that the bigger your model is, the better your performance. So people just started building bigger and bigger models. So here in the illustration, we saw that on the y-axis
bigger models. So here in the illustration, we saw that on the y-axis is the test loss. So the lower, the better. So we saw that the more compute you use, the better your test performance. And same with increasing the data set size, and
performance. And same with increasing the data set size, and same with increasing the number of parameters. But then, as you know, compute is not infinite. So there was a natural question that came out of the community, which was, okay, if we give you a given budget, a given compute
budget, can you choose, I guess, some quote-unquote optimal number of parameters and dataset size on which you want to train your model? And so we saw that there was this paper that was published in the early 2020s,
which actually studied the relationship between, I guess, if you vary the dataset size and the size of your model and the performance on the test sets. And then we saw that actually most models at the time were what we say under trained because they were too big compared to the
data set that they were trained on. The data set that they were trained on was not as big as they should have been. And so in particular, there was kind of a rule of thumb that came out of this, which was if you have a given number of parameters in your model, you should at least train it on 20 times the number of parameters in terms of tokens.
So for instance, if you have a 100 billion parameter model, you should train it on at least 2 trillion tokens. Because 2 trillion is 100 billion times 20. So that's kind of the rule of thumb that people have used. And then, as I mentioned previously, these models are huge.
used. And then, as I mentioned previously, these models are huge.
So people have tried to also make the computation more efficient.
And so there was this method that we saw, which is actually quite important, flash attention. And flash attention is a method that
attention. And flash attention is a method that leverages the strength of the underlying hardware. And in
particular it looks at, so GPUs more particularly, it looks at the kinds of memories that a GPU has. So it has a big but slow memory and a small but fast memory, so the HPM and SRAM respectively. And we saw that this method
tries to minimize the number of reads and writes to the big and slow memory, to the HPM. And so the way it was doing this was to divide the computation in little bits that it would send to SRAM, which is the small but fast memory,
so that it can do the end-to-end computation and then send it back to where it was in order to do the full end-to-end computation. So that
method is an exact method, meaning that we're not doing any approximations to the results, but it led to significant speedups And in particular, there was this second idea from the paper, which is kind of an important one as well, which was that sometimes it's okay for you to not store results. It's okay for you to
just throw them out and then recompute when you need them again. So there is this idea of recomputation using what I described, which led to faster runtimes even though we were doing more computations.
So that was flash attention. And we also saw a number of other methods that were meant to, I guess, parallelize the computation. So we saw data parallelism, which was this idea of not having all your data be processed on a single GPU, but instead divided into multiple
places. And then we had the second method, which was model
places. And then we had the second method, which was model parallelism, where even for a given forward pass, you would actually involve multiple GPUs. So anyway, there were a lot of very interesting techniques, a lot of different ideas about how to train
this model in an efficient way. In particular,
so what I described here is mostly important for the first step of the training process of an LLM, which is called the pre-training, which is meant to teach the model about the structure of language, about the structure of code. And in particular, this model was trained with huge amounts of data, so
code. And in particular, this model was trained with huge amounts of data, so think about trillions of tokens, or even tens of trillions of tokens.
And so that first step goes from an initialized model to a model that is able to autocomplete because it is trained with an objective of predicting the next token. So at the end of this first stage,
next token. So at the end of this first stage, you have a model that knows how to autocomplete, but you have a model that is not very helpful because it only knows how to complete things.
So in order to have the model be useful for use cases, we had this second step, which is called the fine tuning step, where we teach the model on the kinds of input-output pairs that we wanted to perform well. So this is also called the SFT stage, supervised
well. So this is also called the SFT stage, supervised fine tuning stage, And at the end of the second step, we have a model that not only knows the structure of text and codes, but also is able to behave in the way you want. But so far, up until
step number two, we have only taught our model what to do. We have not taught it what to not do. And this is why we
do. We have not taught it what to not do. And this is why we had our third step, which was the preference tuning step, where we took our model that went through the pre-training stage, that went to the SFT stage, and now we want to inject some negative signal as well, as in,
I want you to prefer this compared to this output.
And this third step uses preference data, so like the name, So preference tuning uses preference data, which is typically pairwise data where humans say, okay, I prefer this output compared to that output.
And typically the model here is able to align the kind of output it produces with human preferences that could be along the dimension of usefulness, safety, friendliness, tone. There's a bunch of different dimensions.
So that's what is happening in this third step. And in this third step, it's actually in lecture five that we dug into what that third step was about.
So if you remember, we had drawn a parallel between the way our LLM produces tokens and I guess what people in the reinforcement learning fields I guess, consider how a given policy is interacting with some environments and performing some action
and being in some states. And the reason why we drew that parallel was to be able to leverage some RL based techniques in order to train our model. So in this case, we said our LLM is a little bit like a
model. So in this case, we said our LLM is a little bit like a policy. So given some states, which is the input it has received so
policy. So given some states, which is the input it has received so far, it can perform the next action, and in this case it is to predict the next token. And this prediction is made in the environment of tokens. And when
we predict a completion, what we do is at the end of the day, we have some signals, some rewards, which can be the human preference.
So this is the parallel we drew with the RL world. And with that in mind, we talked about rewards. But the problem is that rewards are only available for a limited set of data, which is why we saw how to model rewards. So we saw this formula. If you remember, it's called the Bradley-Terry
rewards. So we saw this formula. If you remember, it's called the Bradley-Terry formulation, which models how the probability of an output being better than another one is as a function of I guess two scores like the score of output i and the score
of output j. And we saw that reward models they are typically trained by having this formulation in mind in a pairwise fashion.
So what this means is a reward model you give it two outputs, you say this one is good, this one is bad, and then I want you to say this one is good. You train it in a pairwise fashion, but then your model is actually predicting always two scores. It's always predicting the score RI for output I,
RJ for output J, and so at inference time, you're only giving it one output. So I think that's like one subtlety, like we train it in a pairwise way, but at inference time, we're kind of using it in an individual way, if that makes sense. And so once we trained our reward
model using this formulation, then we were able to use it to steer our LLM towards the direction that we care about. So if you remember, the way we steer our LLM in the direction of human preferences is to give it a prompt so that it can
produce a completion, AKA a rollout, or in simpler terms, an answer. And then we take this prompt, we take this answer, we put them both in the reward model that tells us how good the model response is. And depending on what the reward model says, we can
tune the weights of the LLM in a way that maximizes the reward that we saw, which is trained on human preferences. And
the loss function of this RL setup is typically something that tries to maximize rewards. but
also keep the model close to the base model. And here by base model, we mean the SFT model. And the reason why we want that is because this reward is imperfect. So we saw this phenomenon of reward hacking,
is imperfect. So we saw this phenomenon of reward hacking, where your reward can be imperfect and the LLM can exploit its imperfect nature to tune it in a way that actually does not align with what you want it to be. So you want the LLM to not be too
far from the base model, which is actually already a good model. So it's a way to regularize that if you want. And you also want the iteration updates to not be too big either. So you typically have these two constraints, you don't want it to deviate too much from the base model,
but you don't want it to deviate too much from the previous RL iteration. And
then just as a reminder, I think this was lecture five, I think was the most technically challenging of the whole class. So completely fine if the first time you were like, you know, what's happening? But hopefully now it should be a little bit more, more clear. Cool, and then after lecture five, we were like, okay,
we've done a lot of hard work. So the good thing is, you know, we're in 2025 and in the past 12 months or now 14 months, we've seen a lot of models that were being released with these reasoning capabilities.
And the way they were trained to exhibit these advanced reasoning capabilities was actually leveraging a lot of the techniques that we saw in lecture five.
Just like RL based techniques. And in particular, what we want our LLM to do is to output a reasoning chain before producing the final answer. And the reason why we wanted to do that is because people have
final answer. And the reason why we wanted to do that is because people have seen that it improves the performance of the model. And so it's actually relying on this idea of chain of thoughts, which I believe we saw in lecture three, which is a prompting technique to have your model output the reasoning before
outputting the response. So long story short, up until lecture six, our LLM was having a prompt as input directly outputting the output. But in
lecture seven, we said, sorry, in lecture six, we said, well, Let's have our LLM actually first output a reasoning chain that the user may or may not have access to before outputting the final answer. So
you want to teach the LLM to do that, so how do you do that?
Well, first before doing this, I just want to show you this chart which we saw, which is the performance of the model as we're teaching it to produce these reasoning chains. So people have typically measured the improvement in performance by comparing it to,
I guess, certain benchmarks. And this one is a popular one, the AIM benchmark, which is a math benchmark. And we saw that as the training progresses, the accuracy number of, I guess, what the LLM outputs is increasing. But back to what I was saying, the key
technique that we use to teach the model how to output these reasoning chains is leveraging the RL techniques that we saw in lecture 5. And in particular, up until now we saw PPO, which was the main RL algorithm that people
were using. up to maybe last year. And now people are
were using. up to maybe last year. And now people are kind of prioritizing GRPO as an R algorithm in order to teach the model to be better at reasoning tasks. And there are several reasons to that that I will explicit right now. So we saw this illustration that
compared how GRPO was differing with PPO.
And if you can see in the graph, there are a few things that are different. The first thing is that GRPO does not rely
different. The first thing is that GRPO does not rely on a value model. So who remembers what a value model is?
Yes, exactly. So the value function is trying to predict what the reward would be if you were to follow the policy of the LLM.
And I guess it's a way to have some baseline as to how good some predictions are. You want to make it more relative.
So the value function is a way for us to make these rewards a little bit more relative to one another. And so that's how PPO was doing this. So it was having a value model that was making these predictions. And
doing this. So it was having a value model that was making these predictions. And
then we had these generalized advantage estimation methods that was combining the rewards predictions with the value function predictions in order to have what we call advantages. So advantages
is how good your output is compared to some baseline.
But then in contrast to that, GRP said, okay, we don't need a value function because it's too expensive to train, to maintain.
What we're going to do instead is generate several completions and then have some formula that compares the rewards of these completions to one another.
So it's going to have some relative effect in a sense that it will make things more relative. And in doing so, you are actually not needed to maintain and train a value function And that's like one big difference compared to PPO.
And the second big difference, which is not represented in this illustration, is that GRPO is typically an algorithm that people have used in the context of teaching your model to be better at reasoning tasks.
And so we saw that these kinds of problems have a verifiable reward. Because when you complete a math problem,
reward. Because when you complete a math problem, you actually know the answer you need to get to. So you don't need to train a reward model to tell you how good your final answer is, because you already know the answer. And so we saw that GRPO was in particular
used in the context of when you actually don't even need a reward model, when you actually have a verifiable reward. So at the end of the day, the only two models you need to keep are the policy model and the reference model, to be able to just compare how far you are from
the reference model. Cool.
I know this one was also a challenging class. I guess so far so good.
And this is also on the final, so which is why I'm I'm taking things more slowly for this second part of the recap. So is everything good so far? Yeah? Okay, perfect. We also saw some extensions of
far? Yeah? Okay, perfect. We also saw some extensions of GRPO. So if you remember, there was some kind of bias that was
GRPO. So if you remember, there was some kind of bias that was a result of the loss function of GRPO having some normalization term. that penalized
tokens that were in shorter outputs.
So we saw that if you use GRPO in its original case, in its original form, we saw that after a certain point, the algorithm will incentivize your model to produce longer and longer answers, longer and longer incorrect answers. And the reason why it does that is because relative
to short incorrect answers, it penalizes less long incorrect answers. And so this is the reason why there are some extensions that people have worked on this year. One of which was GRPO done right. So we saw like that they basically removed the normalization term. And there
done right. So we saw like that they basically removed the normalization term. And there
was another method that we saw, it was called DAPO, D-A-P-O, which also had some variants. And that's for reasoning models.
And then lecture seven, we had a model that, you know, we knew how to train it, we knew how to use it for reasoning tasks, how to train it to be better, but now we wanted, the model to be useful and interacting with outside systems.
So we saw one technique that is kind of an essential technique called RAG, short for Retrieval Augmented Generation, that is meant for you to be able to fetch relevant documents from some knowledge base in order to answer a question. answer prompt. And the
reason why you want to do that is that the knowledge of your LLM is including up to the data that is up to the knowledge cloud updates, which is the max date of what your LLM has been trained on. And from a practical standpoint,
I guess From what we see nowadays, you're typically not training your LLM daily or continuously. And so, in cases where you need your LLM to know about things that happened recently or about things that happened that were not in your LLM training data, you want your LLM to have access to such information.
And so that's how RAG is very useful. So we saw that RAG dependent very heavily on the way it retrieves data.
So we saw that the retrieval parts was mainly composed of two steps. So the
first one was candidate retrieval, which uses a bi-encoder kind of setup, where you're basically doing some semantic search. So you're computing the embedding of the query, you have some pre-computed embeddings of the documents in your knowledge base, and you're taking the ones
that maximize some similarity score, like let's say some cosine similarity.
So this first step is allowing you to retrieve, I guess, a filtered version of the potential documents. And then typically you have a second step which is called ranking or re-ranking because the first step already gives you a ranking, which has typically a more sophisticated setup. So it's a
cross encoder kind of setup where you have your query and your document that are both fed to some model and produces a more precise score. And then you use this final score to rank
precise score. And then you use this final score to rank the final results and you typically choose the top, let's say, K. And
then you add them to your prompts, so it's the augmented part. So retrieval is everything I mentioned so far. And then once you have the relevant documents, you add them in your prompt, which is the augmented part, and you generate the answer.
So the reason why I'm taking so much time on RAG is RAG is such an important concept. Also, if you were to have interviews or maybe in the exam, who knows. So I think it's an important concept to have in mind. The second one that we saw was tool calling.
And tool calling is allowing your LLM to leverage tools. The way it does that
leverage tools. The way it does that is in two steps. The first step is for your model to know which API there is out there. At the end of which your LLM says, okay, I want to use this API and I want to use it with these arguments.
And then you have an intermediary step, which is you just run your API with these arguments. And then the second step is you feed
these arguments. And then the second step is you feed the results of this operation back to the LLM, which then produces a final answer. So that's how tool calling works. So if you say to your LLM, OK,
answer. So that's how tool calling works. So if you say to your LLM, OK, you can use this, use this API, this is how your LLM would leverage that. And then we saw that modern-day agentic
that. And then we saw that modern-day agentic workflows were leveraging both RAG and tool calling as key methods to perform actions.
And we saw an example, a detailed example, which was such that you had some inputs, and then your LLM had a series of different calls in order to perform some action, and then at the end of it, it retrieves, sorry, it returns an answer. Cool, and then
last lecture We saw how we could evaluate LLMs, which is a much tougher thing to do now that LLMs can do a bunch of different things. So we first saw that there were some rule-based metrics that people were using before LLMs came into
play. Metrics that you may have heard like blue, rouge, meteor
play. Metrics that you may have heard like blue, rouge, meteor and so on. But the main limitation was that they were not considering how language could differ but still be correct.
And so this key idea that we saw was why not leverage LLMs to evaluate outputs. And so there is this key idea of LLM as a judge where you receive as input the prompts, the model response, along with the criteria that you want the response to be
evaluated on. And then you want your LM as a judge to output two things.
evaluated on. And then you want your LM as a judge to output two things.
The first one is a rationale for why a given score is outputs along with that score. So nowadays
LM as a judges, they're typically outputting a binary response either pass or fail, true or false, just because it's easier. And we're also having the rationale be output before the score. Because in
practice it's something that also improves the performance of the LM as a judge.
A little bit if you want like reasoning models do, by outputting the reasoning chain before they output the answer. But then we also saw that there were some biases that came with this approach. We saw position bias, which is the way you present the elements to compare matters. So if you present something
first, then maybe the LLM will just prioritize that first. So there was position bias, there was verbosity bias, which is your LLM just preferring longer outputs.
and self-enhancement bias was another one where it prefers its own outputs.
And then we also saw a number of benchmarks that people use nowadays in order to say how great their LLM is. So if you see the releases that come out, there are typically a bunch of metrics across a number of different benchmarks that people know about. So that spans knowledge, The
ability to reason, coding, which is very important because a lot of applications are coding related. And then safety, and this is not an
related. And then safety, and this is not an extensive list, so there's actually many more dimensions.
So yeah, I think that's where we stopped. It was last lecture.
And this is all you are expected to know for the final.
Everything after that is not going to be part of the final.
Any questions on this so far?
Cool. OK, I'm expecting hundreds for everyone for the final. But
yeah, I would say what I went through is going to be foundational for the final. So I guess if you understood everything I said, I think you're going to be ready for the final. So
yeah, but if you have any questions, you know, Shervin and I are always here to, oh yeah, you have a question?
Yes, so the question is, is the scope for the final of lecture five to lecture eight? Yes. So for midterm it was lectures 1, 2, 3, 4 and this
lecture eight? Yes. So for midterm it was lectures 1, 2, 3, 4 and this one is 5, 6, 7, 8 so I guess it's equal size.
Cool? Great. So with that said, we just finished recapping this entire quarter worth of lectures and now we're going to go to the second item of today's menu which is looking at some trending topics. So I'm going to start with the first one.
trending topics. So I'm going to start with the first one.
And I'm going to introduce it as follows. So if you remember, we saw that the transformer was a concept and an architecture that was first introduced in the context of machine translation. So it performed great.
People said, OK, it performs great on machine translation. Why not try it on other text tasks? They tried it, performed great. But now the question is,
text tasks? They tried it, performed great. But now the question is, can you not use it for things other than text? It's a natural question, right? So in order to answer that question, I just want us to
right? So in order to answer that question, I just want us to remind ourselves that this architecture is relying on this concept of self-attention. And this is what is making the transformer work so
self-attention. And this is what is making the transformer work so well. So if we just recap
well. So if we just recap what self-attention is, this illustration kind of does the job quite well.
You have a query and then you have a bunch of other elements which are represented by your keys and your values. And you want to know which other elements are actually relevant in order to compute the embedding for that query.
So right now we have only used tokens, text tokens. But
text tokens, they're actually vectors.
So if you take those vectors and you actually represent something else than text, like for instance, parts of an image, the question is, would the transformer based on that kind of input also perform well.
And so here, the key question that I want to ask is how can we adapt our transformer to work on non-text input?
And for instance, here we can think of image understanding input. So you have some image. And so it's a traditional computer vision task where you
image. And so it's a traditional computer vision task where you want to know in which class this image belongs to. So you want to know if having some transformer-based architecture would work well in that situation.
Well, the answer to that is, well, first, in order to adapt it to this task, you would take the encoder parts of the transformer because in order to understand what is in an image, you need to classify that image in some sense. So if you remember, if there's
one model in what we saw that was working very well for classification was BERT. Because BERT is encoder only, it computes
was BERT. Because BERT is encoder only, it computes meaningful embeddings that can then be used for projection purposes or for classification purposes. So it's a very natural choice that here we would have. So here we would just keep the encoder part of
the transformer and then have the self-attention mechanism come into play and compute meaningful embeddings that we could then project for our relevant task. And this is exactly what group of researchers did back in 2020. So have you heard of VIT? Vision
transformer? Yeah? No? Yeah? So what I described here is exactly what they did. So they took an image, they divided that image into patches. Those patches
were represented by some vectors and of course you have some some kind of position information that allows you to know where your patch is in the image.
And then you just put it through the transformer encoder, so the encoder part of the transformer, and you compute the representation corresponding to the CLS class, very similar to BERT. And you would just project that representation over some classes of interest,
And then you would perform your computation like this. So what that paper found was that if you train
this. So what that paper found was that if you train such a model on a lot of image data, on a lot of image data, you then outperform these traditional convolutional neural network kind of methods.
And so it was kind of remarkable. Because, so why is it remarkable? Because in the vision case, so there is this concept of
remarkable? Because in the vision case, so there is this concept of inductive bias where you want to gear your model towards looking at certain things in order to deduce the results. So convolutional neural networks are a kind
results. So convolutional neural networks are a kind of model that are designed in a way for you to look at the image in some sliding way. You look at your image a little bit like you would look at it in practice as a human. And people had
hypothesized that such a bias, such an indicative bias would actually make sense for something like a vision task. So you contrasted with the vision transformer, which is actually letting all parts of the image attend to one another, which has, on the other side, very low inductive bias. So what this paper showed
was if you give your model enough data, then it will actually learn how to classify, I guess, your images in these classes. So this was like kind of a remarkable result. So I think that's
these classes. So this was like kind of a remarkable result. So I think that's like pretty remarkable and a nice extension of everything we saw. So with that in mind, I want us to just go through an end-to-end example of how you would process an image and go through that
VIT, so vision transformer, in order to make your prediction. So here you would take like your favorite image that you would just splits into patches. So here
you can think of you pre-defining some fixed size patches.
So here I would say like three by three let's say and then each patch has some fixed number of pixels. And then what you do is for each patch you try to have some vector representation.
So you can think of each patch as So what is the patch? So it's
composed of pixels. So if each pixel has three values, which correspond to red, green and blue, then you can find a way to project those on some lower dimensional flattened space.
And you can learn how you would project that through some kind of linear layer.
So long story short, You just find a way to associate a vector to each of these patches, which you then represent every single one of your inputs. And then you have, of course, a special embedding for the CLS token, which you can also learn. And you add
the position embedding. So you do the same for all of your inputs.
And then very similar to BERT, Just put that through your encoder, let everyone interact with everyone. And then at the end of the day, what you care about is a representation of the input that is meaningful. So typically people take the encoded embedding of the CLS token. So the reason why they take that
is one, because it's a convention, but the second one is this CLS token, so the encoded embedding is actually an embedding that has interacted with all other tokens through this self-attention mechanism.
So it has seen everything. And then you would project that CLS token encoded embedding onto some class through a feed for neural network in order to predict your final class. In this case, we know it's a picture of a teddy bear, so here we would want the model to
classify this as a teddy bear. So far so good?
Does that make sense? Cool. So now,
okay, we know how to process image input right now.
So another question is how would you have your LLM answer questions about your image. Which is
something that you can do actually nowadays. Like if you open ChatGPT, you can input an image and ask it questions. So
you would have two kinds of inputs. So you would have an image, which we saw we can find a way to represent. And then the text, which you now know very well how to represent, like with tokens. So the way you would allow the model or let the model process all
of this is typically as follows. So there are like a couple of methods. The
first one is the most common one, which is you just feed everything as input.
So the image token as input, the text token as input, and you have some representation to have the model just know that these are image tokens and this is like text token. And then you Let it generate an answer in a decoder-only fashion, autoregressive fashion, exactly like you would do it.
The first method, and a lot of such models are designed that way. So
there is, for instance, a very popular open weight, I believe, vision language model, VLM, that's what they're called. A model called LAVA.
And this is how they do it, so they have some encoder on the image parts that produces some tokens that are then concatenated with text tokens that are input into the LLM. So that's one method.
The second method, which is less common, is to have the images be input at the cross-attention layer. So
here what you would do is you have your text input and then your image input. You don't put it in the input, you actually let it interact with the
input. You don't put it in the input, you actually let it interact with the text tokens within the cross-attention layer. And this
is something that, for instance, LAMA3 had represented in their paper.
This technique is typically less common. The first one is more common.
So I guess what I want to say is CME 295 focused on the transformer specifically in the case of text to text problems. So text generation is all, you know, this class. But
we also have the transformer that was used for non-text applications. So here we saw image understanding, so vision understanding with VIT. And then we will not have the time to say this now, but also for image generation tasks, we can also have parts of the transformer be used in that architecture.
And so you may hear about diffusion transformer or multi-model diffusion transformer that actually rely on the self-attention mechanism. This is actually not an exhaustive list. This is
mechanism. This is actually not an exhaustive list. This is
also something that has been used in other domains like recommendations, speech, and so on. So I guess what I want you to remember from this is that Transformers was an architecture that performed very well for machine translation tasks, but
then It proved to perform very well for other text-related tasks. And it was then reused in a bunch of other domains, which also
tasks. And it was then reused in a bunch of other domains, which also proved to be quite successful. So I would just encourage you after this class to also keep an open mind for non-text-related transformer applications. And the ones that I mentioned here are maybe just the first few pointers
applications. And the ones that I mentioned here are maybe just the first few pointers into the kind of papers that you can look at. Cool. So here
we said that Transformer was something that came from the text world that's also useful and used in other worlds. Now I want to tell you about something else that was, I guess, used and useful in the non-text world that may be useful in the text world. And I want to tell you about
diffusion-based LLMs. So who has heard the term diffusion?
Who knows about diffusion? Yeah? Cool. So we will see how we can apply that to LLMs. So this is a very trendy topic. I believe the first paper started in the early 2020s, but I guess it's only now that people are starting to have this really work. I
just want to start with a motivation, which is that up until now, we have taken for granted the fact that our LLM is an autoregressive LLM.
And by autoregressive, what do I mean by that? So it takes some inputs, and what the LLM tries to do is to predict the next token. So
given everything so far, we predict the next token. We do that, And then we take that token that we just predicted along with everything that we have predicted so far. And then we again predict the next token.
so far. And then we again predict the next token.
We predict it and we go again and again up until finishing the sequence with that end of sequence token, which makes the generation stop. So this is a true autoregressive generation as in We take the input so far in order to predict the next token, and then we repeat this process until the
end. So it's something that
end. So it's something that people now try to kind of give it a name, which is like autoregressive model, kind of model. So ARM, if you kind of see this notation, that's what it means. The problem with that kind of paradigm is that
it means. The problem with that kind of paradigm is that inference time generation is actually not something you can parallelize because you always need what's before in order to predict the next one.
But I just want to say that inference time generation is not parallelizable but training is parallelizable So if you remember, the way we do training is we input all the tokens that we want our model to predict and then we let the model
generate tokens out of this. So basically in a decoder-only setting, you have this causal mask which lets your model not cheat if you want and not use the future ones. So I just want to say that when I say that this paradigm is not parallelizable, I just want to emphasize on the
fact that it's inference time that I'm saying. Training time, you can actually parallelize that quite well. So as I mentioned, that's
quite well. So as I mentioned, that's one of the reasons why people have tried to look at other paradigms. And in particular, so if you know about diffusion, you know that it works very well for the vision domain. And so people have tried adapting this
paradigm for the text generation case. And so this is like a bunch of screenshots we took from announcement that happened this year. So for instance, earlier this year, there was an experimental text diffusion model from Google that they presented during the IO events, which was very impressive because it led to a lot of speed ups.
And then we have some different startups. So Inception is one of them that made headlines, I believe a couple of weeks ago or last week. No, sorry, a month ago that also are pursuing this route. So all of that to say that this direction is a very trendy and hot direction that
potentially has a lot of promise. But the key issue with this is that text is discrete, whereas images are continuous. And we're going to see why that distinction that I just made matters. So I'm going to try to explain to you
what diffusion is in two minutes. So in the image world, in order to generate an image, what people typically do is they start from noise and then they try to generate some image. Now you may wonder, okay, why noise? Well,
you cannot do something that's, you know, autoregressive because I guess like if you were to say, okay, let's predict the pixels one at a time, it's just not tractable because there are many pixels in an image and this is typically not how you produce an image. But some other reasons are that
Noise is just something that you can model very well with some very popular distribution.
So, Gaussian distribution, if you know about it, it has very nice properties.
Also, noise is very easy to sample, so it's very easy to start with that.
Noise is also a way for you to introduce randomness, because you don't necessarily want to always produce the same image. You want to have the choice of generating images that are slightly different from one another and just mathematically it works quite well and speaking of
that the goal is to learn some transformation that would allow you to go from noise to the target image distribution So all of what I said here is just kind of reasons for me to tell you, okay, noise is actually a choice that is quite
natural to start with in order to generate an image. And
to give you an analogy, let's suppose you're a sculptor, so the person who does sculptures. So if you want to do a sculpture, you typically start with some rock. But then rocks are different from one another. They always have things that are unique to them. And
still you would focus on what to remove in order to obtain the end sculpture. So you can think of the rock as being your noise and your end result as being your target data distribution. So I just want to have this quote by Michelangelo. So the sculpture
data distribution. So I just want to have this quote by Michelangelo. So the sculpture is already complete within the marble block. Before I start my work, it is already there. I just have to chisel away the superfluous material. The reason why I'm reading
there. I just have to chisel away the superfluous material. The reason why I'm reading that is you can have a nice analogy between what Michelangelo said and the process of denoising the noise to get an image. So this is all just motivating how image generation is
an image. So this is all just motivating how image generation is done. So you start from noise and you want to generate an image. So the
done. So you start from noise and you want to generate an image. So the
way you do that for diffusion is to learn some transformation that would allow you to go from noise to image.
So you have two steps. I mean you have more than two steps, but we have these two main steps for diffusion models where you first want start from clean images and you add noise gradually until you obtain some very noisy image and then from that
what diffusion models try to do is to predict the noise to remove in order to obtain the image. So it's a little bit like you're the sculpture person and you have your rock. You just want to learn what pieces of rock you need to remove in order to obtain your final piece
of art. So this is what diffusion is. So it works pretty well
of art. So this is what diffusion is. So it works pretty well because noise, as I mentioned, is typically something that people draw from a Gaussian distribution. Gaussian distributions are mathematically very well
Gaussian distribution. Gaussian distributions are mathematically very well defined, have a lot of nice properties. So that's why they work so well.
But now the question is how would you adapt this to the text world? Because
in the text world, as you know, we're talking about tokens. Tokens are discrete.
So there's not this concept of adding noise. You cannot have that.
So what people have tried to to do was to find the text equivalent that would make sense. And this is what the current research points to, which is that noise is to images what the mask token is to text. So
mask is just a way for us to just not have the information coming from one part of the sequence. And I just want to go through the revised two-step process that we saw here for text. So here
for the forward process, instead of noising your input, you would just have more and more inputs that would be masked.
So that's the forward process and at the end you would just obtain a sequence full of masked tokens. So what you want to do is to learn some model that allows you to unmask these mask tokens in a way that reconstructs
the original sentence. So there's some math that goes into it.
Obviously in a few minutes we will not have time to go into that, but I just want to emphasize on the key idea, which is You want to do diffusion, but in a way that makes sense for text inputs. And the way it would make sense is to consider the noise for images as being mask tokens for text inputs.
Cool. And so with that in mind, you have a bunch of models that are being released these days, which are called masked diffusion models MDM. So whenever you see MDM now you know
models MDM. So whenever you see MDM now you know what kind of model we're talking about. So notations are still very not very well defined so it may change in the future but one other term you will see out there is also DLLM diffusion based
LLM and This is what it's doing. So instead of predicting a token in an autoregressive fashion, like one at a time, what it does is that at inference time, it goes from a completely masked input sequence and it tries to predict what tokens were
behind these masked tokens. So of course, you know, in a real life setting, you would have some prompts here. have some prompts. So of course in order to predict this answer you would have some conditioning. So you would tell your model okay given these prompts you're going to start with all these math tokens. Just
try to predict what the answer is. So in
case you're having trouble with the intuition. So I also had trouble in the beginning you know why would it make sense for text to be solved in a diffusion manner. Because typically when you write, you write one word at a time. So one
manner. Because typically when you write, you write one word at a time. So one
helpful way to think about this is, let's suppose you want to write a speech. So you would not directly write your speech in a linear way. You would first have like a rough plan. You say, okay, I'm going to talk about this in the first place. Second, third, you have some kind
of draft. And then you try to refine what is in each of these sections.
of draft. And then you try to refine what is in each of these sections.
So you can think of diffusion as kind of working like this. So it tries to have a course to find refinement of the outputs.
So it can predict things that are, you know, after a certain token that has not been predicted, but you can think of this as being something that goes from a very drafty version a very refined version.
So that's what I think about this process. Hopefully that's helpful. And the key advantage here is that the decoding is now done in much fewer forward passes. Because previously you had to do as many
forward passes. Because previously you had to do as many forward passes as there were tokens to predict. But here for diffusion you only need to do as many passes as there are steps in your diffusion process, and the number of steps is something you can fix.
So the higher the step, the more high quality your output is, but it's typically much lower than the length of your outputs.
So that is the core reason why this model is so much faster than the autoregressive one. And of course, you know, we don't have that much time, but in case you're curious, in case you're interested, after the final, once you're completely freed in terms of things to think of, just
put some references that could be helpful. There's a paper that came out earlier this year called LADA, Large Language Diffusion model with masking. Actually, I don't remember the full acronym, but it is actually going through the math and why the thing that I
just mentioned works. And then there's like a bunch of other papers that would also be helpful. So the links are at the bottom of the slide in case you're
be helpful. So the links are at the bottom of the slide in case you're interested. And I just want to go through two last
interested. And I just want to go through two last things before giving it to Shervin. First on the advantages of this new paradigm. So the first one is the speed.
So as we mentioned, it's going to be much faster than traditional autoregressive models, especially for outputs that are longer. And so some benchmarks, they even say that it's something along the lines of 10x faster. So
for cases like coding, It can be very powerful because you may have to do several model calls and you as a user, you're just waiting for that code to happen and just having a lower latency makes a lot of difference. The other thing is that the nature of this
of difference. The other thing is that the nature of this approach is actually considering the text as a whole in order to make the predictions and so there's a category of loading tasks that are called fill in the middle, which is about
trying to figure out, you know, you have a bunch of code and you want to know what's missing in the middle. And so fill in the middle and diffusion models are typically better formulated for these kinds of tasks because they can consider, I guess, input from multiple directions. And that's
why this approach can be probably something that can be useful for some applications. So in terms of the current work, so
some applications. So in terms of the current work, so these models, they look great. You know, what I mentioned to you looks great, but the performance was not on par with the current frontier models, at least for some time. But it's something that may change. So the papers that I
time. But it's something that may change. So the papers that I mentioned, they are actually posting performance that is kind of catching up with the models that are autoregressive. So there is some promise in there. And then
the other line of work is just to adapt all the techniques that people have come up with. Like for instance, reasoning chains, how do you adapt that for diffusion?
and so on. There are so many techniques that are intrinsically better suited for autoregressive kinds of models that can be adapted for that.
And that's what people are working on. So long story short, what we saw was that things we saw in this class could be used other domains, like for instance we saw the vision transformer that was borrowing the transformer for vision related tasks, but we also saw that things
from other domains could also be used in the text world, and that's what we saw with diffusion LLMs. And so this is probably, you know, of course a subset of everything that's happening. And so with that, I think we're concluding our second item of our menu. And I'm going to just
give it to Shervin. Thank you, Afshin.
And with that, welcome to the last part of the season finale of CME 295. And as Afshin mentioned, now is the time for some closing thoughts. and
295. And as Afshin mentioned, now is the time for some closing thoughts. and
see what we can get away from this class and the concepts that are neighboring to it. So first, Afshin went through the concept of diffusion and images and we saw some similarities we could draw with text. Now we're going to see what kinds of inspirations have both
modalities taken from each other. And we're going to see that actually like a lot of things can be reused. So the first thing that I want to mention in terms of what has been reused is the architecture part. So Avshin mentioned this diffusion concept that
part. So Avshin mentioned this diffusion concept that was born in the field of images, but that was taken for text and was able to yield lower latency, like higher speedups, which is great when you're an user. So this was one example of a win. And then on the
an user. So this was one example of a win. And then on the other direction, traditionally images have been dealing with convolutions, mostly as model architecture type. But
these papers saw that replacing convolutions with transformers was very good, even yielding better results. So all these latest diffusion-based papers in the field of images typically use transformers. And here I'm linking one of the papers that Avshin already briefly went through.
But not only is it the architecture side that is the subject of pollination between these modalities, you could even think of other kinds of components like the inputs. And for this one, I want to mention the example of DeepSeq OCR. So I don't know if you've heard of this paper. It
just came out very recently. And contrary to what the name suggests, OCR stands for optical recognition, character recognition. And it's usually a field that tries to convert
character recognition. And it's usually a field that tries to convert some scanned image into text. But actually that paper doesn't boast some improvement on the OCR task itself. Rather it showed that you could learn some function that reconstructs
text tokens based on vision tokens. And not only on vision tokens, on very few vision tokens. So it showed that representation power of patches of images tokens was very strong and you have some researchers that bring some rationale to it like you know hey
tokenizers are not the best tool anyway and then things in patches convey the meaning of text already with the example of emojis and so on that you would otherwise need to represent with way more tokens in text And then another example I want to mention is even when you look inside the
architectures, some of the tricks can be reused and adapted in each field. So here I'm mentioning the example of rope, which Efsin mentioned in the
field. So here I'm mentioning the example of rope, which Efsin mentioned in the recap, that was used in text to represent the relative position of tokens.
And in the case of images or even multi-model setting where you have the presence of both text and image within the architecture, you're able to adapt that trick by reformulating it in 2D. So here the figure shows how you can attribute rope
2D. So here the figure shows how you can attribute rope positions in the 2D grids and how you could place text tokens such that the relative computation of position still makes sense.
And, you know, even beyond that, I would say that the ongoing research for Transformers is very much alive. And people are still figuring out all the details that we're working with today. And you see refinements all the time coming in terms of new papers. So it's, you know, it's something that is still developing.
And you can look at it from multiple angles. One is
each of these design decisions that we've had, they are still being iterated on. So
like I'm listing a few items here as an example. So one is the optimizer side. You might be familiar with Adam optimizer and it's update role,
optimizer side. You might be familiar with Adam optimizer and it's update role, which has been popular for quite some time. And it seems that State is being challenged with newer papers. So I referenced here Kimi K2 paper that came out a few months ago that introduces a new kind
of optimizer called Muon and the latest version of that Muon clip seems to be a potential candidate for this optimizer to become like the new standard so these fields even like as basic as it can be is still developing But it's not only the optimizer side, you also have the topic
of normalization, where Afshin mentioned that the difference between the original transformer paper and what you see today in LLM papers, you don't have the same kind of normalization anymore. In the past you had post-norm, but now you have pre-norm, which brings the normalization earlier on in the layer.
But beyond that design choice of the location of normalization, you even have the type of normalization that changes. So the
transformer paper used layer norm, but these days you might see other kinds of normalization techniques like RMS norm, which uses less parameters and others. So
the theory behind it is not set just yet. So you have other kinds of parameters. Avshin mentioned grouped query attention paper. And you see these days in LLM papers not a fixed
paper. And you see these days in LLM papers not a fixed design. Rather every paper adopts their own technique. Sometimes you see
design. Rather every paper adopts their own technique. Sometimes you see one kind of attention used at a given layer but then it switches.
And different papers take different design decisions. So it's not set in stone.
Then you also have activation functions. So traditionally in deep learning, a lot of emphasis was put on ReLU, which is very simple and used, worked very well. But in the world of LLMs, the shift was towards ReLU-like activation functions, but not exactly ReLU. So you had
Gaussian error, linear units, you had like other kinds. and the research there is still ongoing. So you still see new activation functions coming in
ongoing. So you still see new activation functions coming in now and then. And then you have also whether to take the design option of considering the LLM as an MOE or not. And even the number of layers of your LLM and other hyper parameters like number of heads, size
of the of units in the FFN, all of that is still up for debates in terms of like design decisions. So it's not fixed.
And then another area of research I want to mention is the data part, which is crucial. So the first LLMs, they enjoyed a
is crucial. So the first LLMs, they enjoyed a relatively clean state of things because you could scrape the internet and hope for a lot of data that was for sure human generated. So you could learn these patterns from quote unquote, a high quality source, even
though it's like not in high quality format when you look at typical internet data, but it was still generated by human. So these days the state has changed. You type anything you want in your favorite search browser. Chances are the
changed. You type anything you want in your favorite search browser. Chances are the first results are 80% LLM generated. So are we doomed? Maybe not.
Because actually you see the development of more and more work in data curation.
So in the past you would just scrape the whole internet and train next token prediction on it. But now you have more and more this work of curating data sets of interest and you have companies that work on it and you have the emergence of newer fine-tuning modes. So in the past you had
pre-training and then fine-tuning. Now you have pre-training, mid-training and fine-tuning and the mid-training part trains still on a large corpus of data but higher quality. So people are finding ways around it and I would
higher quality. So people are finding ways around it and I would say The picture is not all grim, but it's just that we need to do more work in order to have meaningful data at hand. And the paper that is linked at the bottom of the slide is dealing with the phenomenon of
what if you were training on LLM generated data and it's talking about a concept called model collapse. And it says that LLM generated text typically less diverse so the data distribution that you would see at training time changes and it leads to less meaningful
learning at training time which is why it's typically bad and it motivates the need for more work on the data part okay nice and then you know even taking a step back on the very architecture we've been using all along is it the best one It's not clear. So that itself
is an area of research and future breakthroughs might come from redesigning this architecture.
Okay, so in the past few years, a lot of the research that we have seen has been on improving benchmarks even more each time. So you have a set of benchmarks, everyone tries to get best results, so
time. So you have a set of benchmarks, everyone tries to get best results, so it's a natural trend because you want to have more and more powerful models that fulfill all your use cases. But let's say we reach a point where all the use cases we care about are solved, then what? So
I think we're going to see the emergence of this second border of the Pareto frontier where we care most about about making predictions from LLM cost-effective and still very high quality.
So we see this emergence of smaller and smaller LLMs. I think it has been dubbed small language models, SLM, like in the literature.
And typically you hear sometimes LLM providers say that they lose money even on highest tier plans. So I think it reflects the fact that you need to be smarter in the compute that you spend for serving LLM queries at test time. And I think this will motivate more and more this line of
research in the coming years. And then there is another area that we've not touched on at all in this class, which is the hardware part. So typically the The kind of device that you use to train these
part. So typically the The kind of device that you use to train these LLMs are GPUs, which are great at one thing, matrix multiply.
But the thing is, we've kept these kinds of architectures to train our models, even though the architecture doesn't only need matrix multiplies as a foundational, like atomic unit of computes. self-attention world
and transformer world has all these special needs. You know, the QK transpose part we saw was actually very expensive, which has motivated papers such as flash attention to, as Afshin mentioned, actually forego of some of the data for the sake of not
doing too much movements on the memory side. even if it meant recomputing the same things afterwards. And then a lot of work has been on optimizing where the flow of memory within the GPU resided. And this shows that maybe you need a more optimized hardware architecture in
order to solve these use cases.
So there is a recent paper that came out that actually encodes all of these operations as part of the hardware. So in the past, the core operation that the GPU was great at is matrix multiply. And based on it, you try to build all of the input output that you want. But here,
this paper that came out, I think in September, shows a proof of concept where you can do all of these computations side effect of implementing input and outputs with analog signals. So you have all of these computations that are embedded as part of the hardware and based on pulses as input
that simulates your array values. These hardware
architectures have some physical properties like you could think of Kirchhoff law in the field of intensity, you can add them up. And it uses properties like this to have what you need as input and just read out the result as output.
So when the paper simulated this kind of architecture, they observed without too much surprise quite a bit of improvement both on the latency and energy saving parts. Both explained because you don't need to do the computations yourself. You just get them as a side effect of your hardware.
yourself. You just get them as a side effect of your hardware.
Now I want to take a step back and look at our use cases of LLMs today and what could lie ahead, what could be the most exciting for you. So today we've seen that in just a few years, I think it has
you. So today we've seen that in just a few years, I think it has become quite important to know how to use them if you want to speed up everything you do in your daily life. So a few lectures ago, we talked about the coding case where you have all these AI assistant coding
tools that enable you to turn into code some natural language prompts. So you just ask for something and it will help you do that task with the so-called agent mode. And this is just something that you would do as an engineer. But even beyond that, other
use cases are deeply impacted by it because you have a lot of problems that today can be turned into a text to query or text to code problem. So you could think of the visualization world. You could,
code problem. So you could think of the visualization world. You could,
so for example, there was a launch from Google recently that showed that you could generate visualizations on the fly based on some principles. And this is already changing quite a bit the picture of what you can do today.
Something else that I think is used by most people actually today when using these chatbots is general, like being a general assistant. So you ask about common facts and all the facts that it has learned at training time becomes useful. It
browses the web more efficiently than you and converts into natural language the things that you care about. But then there are also other domains that were impacted by it. So you could think creativity where a lot of jobs such as marketing or others rely on getting something out of a
blank page. And usually if you start from a draft, it's much easier to get
blank page. And usually if you start from a draft, it's much easier to get something done rather than just thinking of it from scratch. So it's used as an aid here. And also one use case that I've seen in class that I think
aid here. And also one use case that I've seen in class that I think is great that some of you do. You know, sometimes I talk about something or Afton talks about something and I see people typing on ChatGPT related concepts. And I
think it's very smart to do that because brainstorming the concepts that you learn is great for actually grasping the concepts of interest and getting that early feedback loop is very useful for learning. So I have a lot of hopes regarding how great of a time it is for you to learn
as opposed to maybe, you know, I've seen a nice time maybe 10 years ago.
So yeah, keep doing that and I think it will be a growing use case.
So looking forward. So I said tomorrow on the slide, but actually two days ago, there was a launch that went in the direction of what I was thinking about, which is like all the agentic things that we talked about, they are still very much
confined into people that know about the field. everyday people, they wouldn't typically use quote-unquote agenting workflows. So I think one field of development that we will see more and more is all these use cases being democratized. So people can now create things that can be useful to them with easier
mediums, just natural language, no need to code. Moving forward
to later, So all this AI assistant coding that I was mentioning, you could think of it as helping you browse the internet in a very natural fashion. So
it seems right now when you just execute tasks, you're way too microscopic in what you do. And this is typically what AI assistant could help you do. So there
you do. And this is typically what AI assistant could help you do. So there
are recent product launches that reflect these growing interests, such as ChatGPT's Atlas. So I think it launched in October, don't know if they release some
ChatGPT's Atlas. So I think it launched in October, don't know if they release some public number on usage. I suspect it's still timid because you still have challenges when it comes to security. Anyone could inject some bad prompt in there and maybe exfiltrate things from you. But I'm
sure the community will come up with more ways to get around it. So in
the past you had HTTPS, for example, to say that the connection was secure. Maybe
tomorrow you'll have some certificate that will guarantee that a website is safe for AI system browsing. And looking even at a higher level, maybe
system browsing. And looking even at a higher level, maybe just browsing on your desktop or your mobile phone at the LLM level. At
the OS level, might be something that the LLM can help. So when we talked about agents, we mentioned how
help. So when we talked about agents, we mentioned how non-reliable they could be because as you have more and more steps, the probability for failure increases and stabilizing predictions is something of interest. And
even in the longer run, I think one test that we can have in mind as to far we've come is whether the common use case of having a customer service that is served by AI is truly useful. So I don't know about you but every time I have a problem and I'm on the phone and I hear some AI assistant maybe LLM powered robot and like you know quick
quick quick I want a human I don't want that and and I think it shows how hard this space is the space of problems is because human has way more dimensions of value that an LLM could bring. So
empathy, groundedness, there are some things that you and I, we perceive as things that make sense, even though they're not in our system prompts. So I think there are hard problems to solve there. And even
moving forward, there are some key challenges that we have with the current architecture. So we saw during the class that you had to go through a
architecture. So we saw during the class that you had to go through a training process that went to like fix some weights, but these weights are not changing afterwards. And we use tricks such as rags, a rag or tools to get around this issue. But could we think of a system that learns
continuously? I think that is an open question. Then there is a topic of hallucinations,
continuously? I think that is an open question. Then there is a topic of hallucinations, which I put into quotation marks because I'm not sure if it's fair to say that the LLM hallucinates. Just because we've trained the LLM to predict the next token by nature, not map statements to facts. So hallucinating
is in some sense a core design choice of these LLMs. personalization, interpretability, safety, the list goes on.
Now I want to briefly cover how you can exercise this muscle of staying up to date from now. So you have archive that usually contains all these great greatest and latest papers that you can take a look at.
Of course the venues like NeurIPS right now great to highlight papers like some papers and then I highly encourage you be besides the papers looking at the associated code bases that the authors provide so right now it's a common place to just provide the implementation of what you're proposing and
I think it's very insightful to learn the concepts and there is a paper with code that existed in the past and that has been replaced by Hugging's face This is trending papers, which I think is a good place to take a look at the latest methods. And then on the
social network side, Twitter or X has a lot of the latest that is often discussed. So you have a strong community there. And if you have an account on that social media, you have a lot
there. And if you have an account on that social media, you have a lot of great people to follow to stay updated. But also you have resources on YouTube with a highlight on Yannick Kilcher, which I think was the first YouTuber to cover the Transformer paper back in 2017 in great detail. And I think,
you know, some of these YouTubers, they're very good at talking through papers in great detail. And another highlight is Andres Karpathy,
great detail. And another highlight is Andres Karpathy, who was at Stanford about 10 years ago, and he's one of the best educators out there, so I highly recommend his videos. And you have company blogs that are also great. And like this study guide that we
also great. And like this study guide that we associated with the class, so we've had it for this year and what we'll try to do in the coming years is try to keep it updated at least on a yearly basis, so you can consider this resource as maybe some
companion. And we've got the chance to collaborate with experts around the
companion. And we've got the chance to collaborate with experts around the world to make it available in other languages in case you're interested. And taking a step back, just want to say that Afshin and I were very grateful to teach this class this quarter. Thank you so much for coming here on
Friday evening, which is quite telling because Friday evening is usually time for fun, not for lectures. And also now I think it's probably your last lecture of the whole
for lectures. And also now I think it's probably your last lecture of the whole quarter because next week is finals. So yeah, thank you for coming and for asking all these great questions. You were one of the reasons why this class was so great and interactive. Also, thank you to the folks who are watching online
from home. I was one of you eight years ago. I would almost never go
from home. I was one of you eight years ago. I would almost never go to class, always watching lectures from home from a cozy place. I hope the lectures were entertaining and that you got something from it.
And I couldn't conclude without bringing our favorite teddy bear one last time, thanking you all for your attention and wishing you all the best.
Thank you.
Loading video analysis...