Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 1: Overview and Tokenization
By Stanford Online
Summary
## Key takeaways - **Build to Understand**: The course philosophy emphasizes that true understanding of language modeling comes from building these systems from scratch, rather than relying solely on abstractions like prompting proprietary models. [49:22], [03:56:00] - **Abstraction Layers Are Leaky**: While layers of abstraction can unlock research, they are not always transparent. Understanding the fundamental technology requires 'tearing up the stack' and co-designing data, systems, and models. [03:18:00], [03:36:00] - **Small vs. Large Scale Models Differ**: Optimizing for small-scale models can be misleading; architectural components like attention vs. MLP layers have different computational proportions at larger scales, and emergent behaviors like in-context learning only appear at scale. [05:14:00], [06:18:00] - **The Bitter Lesson: Scale + Algorithm Efficiency**: The 'bitter lesson' isn't just about scale, but about the crucial interplay of algorithms and scale. Algorithmic efficiency is paramount, especially at large scales where waste is costly, and has historically outpaced hardware improvements. [09:33:00], [10:10:00] - **BPE Tokenization: Adaptive & Efficient**: Byte Pair Encoding (BPE) tokenization adaptively creates tokens based on corpus statistics, representing common character sequences as single tokens and rare ones with multiple, striking a balance between vocabulary size and sequence length. [01:11:05], [01:12:22] - **Data Curation is Critical, Not Passive**: Data for training language models is not passively acquired from the internet; it requires active acquisition, processing, filtering, and curation to ensure quality and remove harmful content, as much of the web is 'trash'. [45:04:00], [47:05:00]
Topics Covered
- Why Deep Understanding Requires Building AI From Scratch
- Small Models Fail to Predict Frontier LLM Behavior
- Algorithms at Scale: Reinterpreting the Bitter Lesson
- Data Curation: The Unseen Art of Building LLM Datasets
- The Future of LLMs: Data Constraints Reshape Research Priorities
Full Transcript
Welcome everyone. Um, this is CS uh 336
language models from scratch and this is
our the core staff. So I'm Percy, one of
your instructors. Um, I'm really excited
about this class because it really
allows you to see the whole language
modeling building pipeline end to end
including data systems and modeling. Um,
Tatsu, I'll be co-eing with him. So,
I'll let everyone introduce themselves.
Hi everyone. I'm Tatsu. I'm one of the
the co-instructors. I'll be uh giving
lecture in, you know, week or two,
probably a few weeks. Um, I'm really
excited about this class. We Percy and
I, you know, spent a while being a
little disgruntled thinking like what's
the really deep technical stuff that we
can teach our students today. And I
think one of the things that is really
you got to build it from scratch to
understand it. So, I'm hoping that
that's sort of the ethos that I take
away from from that class.
Uh, hey everyone. I'm Roit. Um, I
actually failed this class when I took
it. But now I'm your CA. So when they
say anything is possible.
Hey everyone, I'm Neil. I'm a third year
student, PhD student in the CS
department. I work with you. Um, yeah,
mostly interested in my research on
synthetic data, language models,
reasoning, all that stuff. So yeah,
should be up on the quarter. Uh, hey
guys, I'm Marcel. I'm a second year PhD.
I work as good. These days I work on
health
and he was a topper of many leaderboards
from last year. So he's the number to
beat. Okay. All right. Well, thanks
everyone. Um, so let's let's continue.
As Satu mentioned, this is the second
time we're teaching the class. We've
grown the class uh by around 50%. I have
three TAs instead of two. And one big
thing is we're making all the lectures
uh u on YouTube so that um the world can
learn how to build language models from
scratch. Okay. So why do we decide to
make this course and endure all the all
the pain? Um so let's ask GPD4. So if
you ask it why teach a course on
building language models from scratch.
Um it the reply is teaching a course
provides foundational understanding of
techniques fosters innovation. Um kind
of the typical kind of generic blathers
okay so here's the real reason so we're
in a bit of a crisis I would say
researchers are beingcoming more and
more disconnected from the underlying
technology. Um eight years ago
researchers would implement and train
their own models in AI. Even six years
ago, you at least uh take the models uh
like BERT and download them and
fine-tune them. And now many people can
just get away with prompting a
proprietary model. So this is not
necessarily bad, right? Because as you
introduce layers of abstraction, we can
all do more. And a lot of research has
been unlocked by um be the simplicity of
being able to prompt the language model.
and I do a fair my share of uh
prompting. So, so there's nothing wrong
with that. But it's also remember that
these abstractions are leaky. So in
contrast to programming languages or
operating systems, um you don't really
understand what the abstraction is. It's
a it's a string in and string out, I
guess. Um, and I would say that there's
still a lot of fundamental research to
be done that required tearing up the
stack and co-designing different aspects
of the data and the uh systems and the
model and I think really that full
understanding of this technology is
necessary for fundamental research. So
that's why this class exists. We want to
enable the fundamental research to
continue and our philosophy is to
understand it you have to build it.
So there's one small problem here and
this is because of the industrialization
of language models. So GPD4 has rumored
to be 1.8 trillion parameters cost 100
million dollars to train. Um you have
XAI building the clusters with uh
200,000 H100s if you can imagine that.
um there's an investment of over 500
billion, you know, supposedly um over
over four years. So these are pretty
large numbers, right? Um and
furthermore, there's no public details
on how these models are being built. Um
here from GPD4, this is even two years
ago. Um they very honestly say that due
to the competitive landscape and simply
safety limitations, uh we're going to
disclose no details. Okay, so this is
the state of the of the world um right
now. And so in some sense, frontier
models are out of reach for us. So if
you came into this class thinking you're
each going to train your own GPD for um
sorry um so we're going to build small
language models, but the problem is that
these might not be representative. And
here's some of two examples uh to
illustrate why. So here's a kind of a
simple simple one. Um if you look at the
fraction of flops spent in the in
attention layers of a transformer versus
a MLP um this changes quite a bit. So
this is a this is a tweet from Steven
Fuller from quite a few years ago but
it's it this is still true. Um if you
look at small models it looks like the
number of flops in the attention versus
the MLP layers are roughly comparable.
But if you go up to 175 billion, then
the, you know, the MLPS really dominate,
right? So why does this matter? Well, if
you spend a lot of time at small scale
and you're optimizing the attention, you
might be optimizing the wrong thing
because um at larger scale, it's it
doesn't it just get gets gets washed
out. This is kind of a simple example
because you can literally make this plot
without actually any compute. you just
like do it's napkin math. Um here's
something that's a little bit harder to
grapple with is just emergent behavior.
So this is a paper from Jason Wave from
2022 and um here this plot shows that as
you increase the amount of training
flops and you look at accuracy a bunch
on a bunch of tasks you'll see that for
a while it looks like the accuracy
nothing is happening and all of the
sudden you get these kind of uh you know
emergent of various phenomena like in
context learning. So if you were hanging
around at this scale, you would have be
concluding that well these language
models really don't work when in fact
you had to scale up to get that
behavior. So so don't despair we can
still learn something in this class and
but we have to be very precise about
what we're learning. So there's three
types of knowledge. There's the
mechanics of how things work. This we
can teach you. We can teach you what a
transformer is. You can you'll implement
a transformer. We can teach you how
model parallelism leverages GPUs
efficiently. These are just like kind of
the raw ingredients, the mechanics. So
that's fine. We can also teach you
mindset. So this is something a bit more
subtle and seems like a little bit you
know um fuzzy but uh this is
actually in some ways more important I
would say because um the mindset that
we're going to take is that we want to
squeeze as most out of the hardware as
possible and take scaling seriously
right because in some sense the
mechanics all those we'll see later that
all of these ingredients have been
around for a while but it was really I
think the scaling mindset that open AI
pioneered that led to this next
generation of um AI models. So mindset I
think hopefully we can you know bang
into you that to think in a certain way
and then thirdly is
intuitions and this is about which data
um and modeling decisions lead to good
models. This unfortunately we can only
partially teach you and this is because
what architectures and what data sets
work at no scales might not be the same
ones that work at um large scales and
but you know that's just uh but
hopefully you got two and a half out of
three. So that's um pretty good bang for
your buck. Um okay, speaking of
intuitions, there's this sort of I guess
sad reality of things that you know you
can tell a lot of stories about why
certain um things in the transformer are
the way they are, but sometimes it's
just you know come you do the
experiments and the experiments speak.
Um so for example there's this nom paper
that introduced the swiggloo which is
something that we'll uh see a bit more
in the in this class which is a type of
nonlinearity.
Um and in the conclusion, you know, the
results are quite good and this got
adopted. But in the conclusion, there's
this honest statement that we offer no
explanation except for this is divine
benevolence. So there you go. This is uh
um the extent to our under of our
understanding. Okay. So now let's talk
about this bitter lesson that I'm sure
people have, you know, heard about. I
think there's a sort of a misconception
that the bitter lesson means that scale
is all that matters. Algorithms don't
matter. All you do is pump more capital
into building the model and you're good
to go. I think this couldn't be further
from the truth. I think the right
interpretation is that algorithms at
scale is what matters. And because at
the end of the day, your accuracy of
your model is really a product of your
efficiency and the number of resources
you put in. And actually efficiency if
you think about it is way more important
at larger scale because if you're
spending you know hundreds of millions
of dollars you cannot afford to be
wasteful in the same way that if you're
uh looking at running a job on your on
your local cluster you might run it
again you fail you you debug it and if
you look at actually the utilization and
the use I'm I'm sure open is way more
efficient than any of us um right now.
So efficiency really is important and
furthermore this I think is this point
is maybe not as well appreciated in the
sort of scaling rhetoric so to speak um
which is that if you look at efficiency
which is combination of hardware and
algorithms but if you just look at the
algorithm efficiency there's this nice
open air paper from 2020 that showed uh
over the period of 2012 to 2019 there's
a 44 for X if algorithmic efficiency
improvement in um the time that it took
to train imageet to a certain level of
accuracy right so this is huge and I
think if you I don't know if you could
see the the abstract here um this is
faster than Morris law right so
algorithms do matter if you
didn't have this efficiency you would be
paying 44 times more cost this is for
image models but uh there's some results
language as well. Okay. So with all that
I think the right framing or mindset to
have is what is the best model one can
build given a certain compute and data
budget. Okay. And this question makes
sense no matter what scale you're at
because um you're sort of like act it's
accuracy per resources. And of course if
you can raise the capital and get more
resources you'll get better models. But
as researchers, our goal is to improve
the efficiency of the
algorithms. Okay. So maximize
efficiency. We're going to hear a lot of
that. Okay. So now let me talk a little
bit about the current uh landscape. Um
and a little bit of I guess you know
obligatory history. Um so language
models have been around for a while now.
um you know go going back to Shannon um
you know who looked at language models a
way to estimate the entropy of um
English um I think in AI they really
were prominent in NLP where they were a
component of larger systems like machine
translation speech recognition and one
thing that's maybe not as appreciated
these days is that if you look back in
2007 uh Google was training fairly large
engram models so five gram models over
two trillion tokens which is a lot more
tokens than GPT3. Um and it was only rec
I guess in the last two years that um
we've gotten to that in token count. Um
but they were engram models so they
didn't really exhibit any of the
interesting phenomena that we know of
language models today. Okay. Okay. So in
the 2010s I think a lot of the you can
think about this a lot of the deep
learning revolution happened and a lot
of the ingredients sort of kind of
falling into place right so there was
the first neural language model from
Joshra Benjel's group in back in uh 2003
there was seek to seek models um this I
think was a you know big deal for you
know how do you basically model
sequences from Ilia and uh Google folks
Um there's an atom optimizer which still
is used by the majority of uh people
dating over a decade ago. Um there's
attention mechanism which was um
developed in the context of machine
translation um which then led up to the
famous attention all you need or the aka
the transformer paper in 2017. People
were looking at how to scale mixture of
experts. There's a lot of work around
late 20110s on how to essentially do
model parallelism and they were actually
figuring out how you could train you
know 100 billion parameter models. They
didn't train it for very long because
these are these were like more system
work but the all the ingredients were
kind of in place um before in by the
time the 2020 came around.
Um
so I I think one you know other trend
which was starting in LPU was the idea
of you know these foundation models that
could be trained on a lot of text and
adapted to a wide range of downstream
tasks. So Elmo BERT um you know T5 these
were models that um were for their time
very exciting. we kind of maybe forget
how excited people were about you know
things like bird but um it was a big
deal and then I think a um I mean this
is abbreviated history but um I think
one critical piece of the puzzle is you
know open AI this taking these
ingredients you know they and applying
very nice engineering and um really kind
of pushing on the kind of the scaling
laws embracing it as you know this is
the kind of the minds set piece and that
led to GPD2 and GPT3. Um, Google, you
know, obviously was in the game and
trying to uh, you know, compete as well.
Um, but um, that sort of paved the way I
think to another kind of line of work
which is um, these were all closed
models. So models that weren't released
and you can only access via API but they
were although open models starting with
you know early work by you know Eluther
right after GP3 came out Meta's early
attempt um which uh didn't work maybe as
quite as well um Bloom um and then Meta
Alibaba DeepS AI2 and there's a few
others which I have listed have been
creating these uh open models where um
the the weights are released. Um, one
other piece of I think tibbit about
openness I think is important is that
there's many levels of openness. There's
closed models like GPD4. There's open
weight models where the weights are
available and there's actually a paper a
very nice paper with lots of
architectural details but no details
about the data set. And then there's uh
open source models where all the weights
and data are available and the paper
that where they're honestly trying to
explain as much as they can. Um you know
but of course you can't really capture
everything you know in a paper and
there's no substitute for learning how
to build it except for kind of doing
your
yourself. Okay. So that leads to kind of
the present day where um there's a whole
host of you know frontier models from
open anthropic xi google meta deepseeek
Alibaba tensen and probably a few others
um that are sort of dominate the the
current you know
landscape. So, we're kind of interested
interesting time where, you know, just
to kind of reflect, a lot of the
ingredients, like I said, were
developed, which is good because I think
we're going to revisit some of those um
ingredients and trace how they these
techniques work. And then we're going to
try to move as close as we can to best
practices on frontier models but you
know using um information from
essentially the open you know
community and reading between the lines
from what we know about the closed uh
models. Okay. So just as an interlude um
so what are you looking at here? So um
this is a executable lecture. So it's a
program where I'm stepping through and
it delivers the content of lecture. So
one thing that I think is interesting
here is that um you can embed code. So
if you um you can just step through code
and I think this is a smaller screen
than I'm used to, but uh you can look at
the environment variables as you're
stepping through code. So that's uh
useful later when we start actually um
trying to drill down and giving code
examples. You can see the hierarchical
structure of the lecture like we're in
this module and you can see where it's
it was called from main um and you can
jump to definitions um like supervised
fine-tuning which we'll uh talk about
later. Okay. And if you think this looks
like a Python program, um, well, it is a
Python program. Um, but I've made it,
uh, you'll processed it so for your
viewing pleasure.
Okay. So, let's move on to the course
uh, logistics now.
Um, actually maybe I'll pause for
questions. Any questions about
um you know what we're learning in this
class?
Yeah.
Would you
expect from this class to be able to
lead a team to build a frontier model?
So the question is would I expect a
graduate from this class to be able to
lead a team and build a frontier model?
Of course with you know like a billion
dollars of capital. Yeah of course. Um,
I would say that it's a good step, but I
there's definitely uh many pieces that
are missing. And I think, you know, we
thought about we should really teach
like a series of classes that eventually
leads up to to as close as we can get.
But, um, I think this is maybe the first
step of the puzzle, but there are a lot
of things and happy to talk offline
about that. But I like the ambition.
Yeah, that's what you should be doing,
taking the class so you can go lead
teams and build frontier models.
Okay.
Um, okay, let's talk a little bit about
the course. Um, so here's a website.
Everything's online. This is a fiveunit
class. Um, but I I think that maybe
doesn't express the the level here. Um,
as well as this quote that I pulled out
from a course evaluation. Um, the entire
assignment was approximately the same
amount of work as all five assignments
from the CSU24N plus the final project.
And that's the first homework
assignment. So, not to all scare you
off, but just just giving some data
here. Um, so why should you endure that?
Um, why should you do it? I I think you
this class is really for people who have
sort of this obsessive need to
understand how things work all the way
down to the the atoms so to speak and I
think if you you know when you get
through this class I think you will have
really leveled up in terms of your
research engineering and the comfort
level of comfort that you'll have in
building ML systems at scale will just
be I think um you know something there's
also a bunch of reasons that you
shouldn't take the class for example
example, if you want to get any research
done um this quarter, maybe this class
isn't for you. If you're interested in
learning just about the hottest new
techniques um there are many other
classes that can probably deliver on
that, you know, better um than for
example you spending a lot of time
debugging BPE. Um, and this is really, I
think, about a class about, you know,
the the primitives and learning things
bottom up as opposed to, um, the the
kind of the latest. Um, and also, if
you're interested in building language
models or, you know, 4X,
um, this is probably not the first class
I you would take. Um, I think
practically speaking, you know, as much
as I kind of made fun of prompting,
prompting is great, fine-tuning is
great. If you can do that and it works,
then I think that is something you
should absolutely start with. So, I
don't want people taking this class and
thinking like great any problem the
first step is to train a language model
from scratch. That is not the right way
uh of thinking about it. Um, okay. And I
know that many of you um you know some
of you were enrolled but we didn't we
did have a cap so we weren't able to
enroll everyone and and also for the
people online you can follow at at home
um all the lecture materials and
assignments are online so you can look
at them the lectures are also recorded
and will be put on YouTube although
there will be um some number of week lag
u there um and also we'll uh offer this
class next year so If you were not able
to take it this year, um don't fret,
there will be next
time. Okay. So, the class has five
assignments. Um and each of the
assignments we don't provide scaffolding
code in the sense that the uh the you're
literally give you a blank file and
you're supposed to, you know, build
things up. Um and in the spirit of
learning uh building from scratch, but
we're not that mean. Um we do provide
unit tests and some adapter interfaces
that allow you to check uh correctness
of different uh pieces and also the
assignment write up if you walk through
it does do it for sort of a gentle job
of doing that. But you're kind of on
your own for making um good software
design decisions and figuring out what
you name your functions and how to you
know organize your code which is a
useful skill I think.
Um so one strategy I think for all
assignments is that there is a piece of
assignment which is just implement the
thing and make sure it's correct. That
mostly you can do locally on your
laptop. You shouldn't need compute for
that. And then you should we have a
cluster that you can run um for
benchmarking both accuracy and speed.
Right? So I I want everyone to kind of
embrace this idea of like you want to
use as a as small data set or as few
resources as possible to you know
prototype before running large jobs. You
shouldn't be debugging with one billion
parameter models on the cluster um if
you can help it. Okay. Um there's some
assignments which will have a
leaderboard um which usually is of the
form do things to make perplexity go
down given a particular training budget.
Last year it was I think pretty um you
know exciting for people to try to um
you know try different things that you
either learn from the class or you read
online.
Um, and then finally, I guess this year
is, you know, this was less of a problem
last year because I guess Copilot wasn't
as good, but, you know, curs is pretty
good. Um, so I I think our general
strategy is that, you know, AI tools
are, you know, can take away from
learning because there are cases where
it can just solve the thing you you want
it to do. But, you know, I think you can
obviously use them judiciously. So, but
use at your own risk. you're kind of
responsible for your own learning
experience
here. Okay. So, uh we do have a cluster.
So, thank you Together AI for providing
a bunch of H100s for us. Um there's a
guide to please read it carefully to
learn how to use the cluster. Um and uh
start your assignments early because um
the cluster will fill up towards the end
of a deadline as everyone's uh trying to
get their large runs in.
Okay. Um, any questions about that? You
mentioned it was a five unit. Were you
able to sign up for it for like,
right? So, the question is, can you sign
up for less than five units? I think
administratively,
uh, if you have to sign up for less,
that is possible, but it's the same
class and the same workload.
Yeah. Any other questions?
Okay. So in this part I'm going to go
through all the different components of
the course and just give a broad
overview a preview of what you're going
to experience. Um so remember it's all
about efficiency given hardware and
data. Um how do you train the best model
given your resources? So for example, if
I give you a common crawl dump, a web
dump and 32 H100s for two weeks, what
should you
do? There are a lot of different design
decisions. Um there's, you know,
questions about the tokenizer, the
architecture, systems optimizations you
can do, data things you can do, and
we've organized the class into these
five um units or pillars. So I'm going
to go through each of them you know in
turn um and talk about what we'll cover
what the assignment will involve and and
then I'll kind of wrap up. Okay. So the
goal of the basics unit is just get a
basic version of a full pipeline
working. So here you implement a
tokenizer model architecture and
training. So just say a bit more about
what these components are. So a
tokenizer is something that converts
between strings and sequences of
integers. Intuitively you can think
about the integers corresponding to
breaking up the string into uh segments
and mapping each segment to an integer.
And the idea is that you just you your
sequence of integers is what goes into
the actual model which has to be like a
fixed uh
dimension. Okay. So in this course we'll
talk about the bip pair encoding BPE
tokenizer which is um relatively simple
and um and still is is used. Um there
are I guess a promising set of um
methods on tokenizer free approaches. So
these are methods that just start with
the the raw bytes and don't do
tokenization and develop a particular
architecture that just takes the raw
bytes. Um this work is is promising but
you know so far I haven't seen it been
scaled to the frontier yet. So we'll go
with BP for now. Okay. Okay, so once
you've tokenized your sequence or
strings into a sequence of integers, now
we define a model architecture over
these sequences. So the starting point
here is original transformer. Um that's
what is the backbone of basically all um
you know frontier models. Um and here's
architectural diagram. um we won't go
into details here but uh it there's a
attention you know piece and then
there's a um MLP you know layer with
some you know normalization
um so a lot has actually happened till
since 2017 right I think there's a sort
of sense to which oh the transformer was
invented and then you know everyone's
just using transformer and to first
approximation that's true we're still
using the same recipe but there have
been a bunch of the smaller uh
improvements that do make a substantial
difference when you add them all up. So
for example, there is um the activation
um you know nonlinear activation
function. the squiggly which we saw a
little bit before positional embeddings
there's new positional embeddings um um
these rotary positional embeddings which
we'll uh talk about um normalization um
you know instead of using layer norm
we're going to look at something called
RMS norm which is similar but simpler um
there's a question where you place the
normalization which has been changed
from the original transformer um the MLP
use uh the canonical version is a dense
MLP And you can replace that with
mixture of experts. Um attention is
something that has actually been uh
getting a lot of um attention I guess.
Um there's there's full attention and
then there's you know sliding window
attention and linear attention. All of
these are trying to prevent the
quadratic blow up. There's also lower
dimensional versions like you know GQA
and MLA which we'll get to in a second
um or not in a second but in a future
lecture. And then you know the most kind
of maybe radical thing is other
alternatives to the um transformer like
space models like hyena where they're
not doing you know attention but you
know some other sort of operation and
sometimes you get best of both worlds by
you know mixing making a hybrid model
that mixes these in with transformers.
Um okay so once you define your
architecture you need a train. So
there's a you know design decisions
include optimizer. So atom w uh which is
a variant basically atom fixed up um is
is still very prominent. So we'll mostly
work with that but uh it is worth
mentioning that there is more recent
optimizers like muan and soap that have
shown promise um learning rate schedule
um you know batch size you know whether
you do regularization or not
hyperparameters there's a lot of details
here and and I think this class is one
where the details do matter because you
can easily have you know order of
magnitude difference between a welltuned
you know architecture and something
that's just like a vanilla transformer.
So in assignment one, basically you'll
implement the BP tokenizer. Um I'll warn
um you that this is actually the part
that seems to have been a lot of
surprising maybe a lot of work for
people. So um just you know you're
warned and uh you also implement the
transformer crossmput loss atomw
optimizer and training loop. So again
the whole stack and you know we're not
making you implement you know pietorch
uh from scratch. So you can use pietorch
but you can't use like you know the
transformer implementation for pietorch.
you there's a small list of um uh
functions that you can use and you can
only use those. Okay, so we're going to
have some uh you know tiny stories and
open web text data sets that you'll
train on and then there will be a
leaderboard um to minimize the open web
text perplexity. We'll give you 90
minutes on a a H100 and see what you can
do. So this is last year um so see we
have the top. So this is the number to
beat for this year.
Okay. All right. So that's the basics.
Now after basics
um I mean in some sense you're done,
right? Like you have ability to train a
transformer. What what else do you need?
So the system part really goes into how
you can optimize this further. So how do
you get the most out of hardware? And
for this we need to take a closer look
at the hardware and how we can you know
leverage it. So there's kernels,
parallelism and inference are the three
components of this uh unit. So okay so
to first talk about kernels um let's
talk a little bit about what a GPU looks
like. Okay. So, a GPU um which we'll get
much more into um is basically a huge
array of these um you know little uh
units that do floatingoint operations.
Um and maybe the one thing to note is
that this is the GPU chip and here is
the um the memory that's actually
offchip. Um and then there's some other
memory like L2 caches and L1 caches on
chip. And so the basic idea is that
compute has to happen here. Your data
might be somewhere else and how do you
basically organize your compute so that
um you can be most efficient. So one
quick analogy is imagine that your your
memory is and is where you can store
like your data and the model parameters
is like a warehouse and your compute is
like the the the factory and what you
what ends up being a big bottleneck is
just data movement costs, right? Um so
the thing that we have to do is how do
you organize the compute like even a
matrix multiplication to maximize the
utilization of the GPUs by minimizing
the data movement and there's a bunch of
techniques like fusion and um and tiling
that allow you to do that. So we'll get
all into the details of that and to
implement and leverage a kernel uh we're
going to look at Triton. There's other
things you can do with various levels of
uh sophistication, but we're going to
use Triton, which is developed by OpenAI
in a popular way to build kernels. Okay,
so we're going to write some kernels.
That's for one GPU. So now, um, in
general, you have these big runs take,
you know, ten thousands if not tens of
thousands of GPUs. And but even at 8 it
kind of starts becoming interesting
because um you have a lot of GPUs
they're connected to some CPU nodes and
they also have are directly connected
via N MV switch MVL link um and
the it's the same idea right now the
only thing is that data movement between
GPUs is even slower right um and so we
need to figure out how to put um model
you know parameters and activations and
gradients and put them on the GPUs and
do the computation and to minimize
amount of you know movement
um and then so we're going to explore
different type of techniques like data
parallelism and you know tensor
parallelism and and so on so
um so that's all I'll say about that and
finally inference um is something that
we didn't actually do last year in the
class um although we had a guest lecture
um but this is important because um
inference is how you actually use a
model right it's basically the task of
generating tokens given a prompt given a
trained model and it also turns out to
be really useful for a bunch of other
things besides just chatting with your
your favorite um um model you need it
for reinforcement learning test time
compute which has been you know very
popular lately and even evaluating
models you need uh to do inference. So
we're going to spend some time talking
about inference. Um actually if you
think about the globally the cost that's
dedic that's spent on inference is going
it's you know ex eclipsing the cost that
it is used to train models because
training despite it being very intensive
is ultimately a onetime cost and
inference is cost scales with every use
and the more people use your your your
model the the more you'll need inference
to be efficient.
Okay. So um in inference there's two
phases. There's a prefill and a decode.
Prefill is you take the prompt and you
can run it through the model and get
some you know activations. And then
decode is you go autogressively one by
one and generate
tokens. So prefill all the tokens are
given so you can process everything at
once. So this is exactly what you see at
training time and generally this is a
good setting to be in because um you can
par it's naturally parallel and you're
mostly computebound. What makes
inference I think uh special and
difficult is that this auto reggressive
decoding you need to generate one token
at a time and ends you it's hard to
actually saturate all your GPUs and it
becomes you know memory bound because
you're constantly you know moving data
around and we'll talk about a few ways
to speed the models up. um just speed
inference up. You can use a cheaper
model. Um you can use this uh really
cool technique called speculative
decoding where you use a cheaper model
to sort of scout ahead and generate
multiple tokens and then if these tokens
happen to be good by some for some
definition good, you can have the full
model model just you know score in and
accept them all in in parallel. Um and
then there's a bunch of systems
optimizations that you can do as well.
Okay. So after the systems oh okay
assignment two so um you're going to
implement a kernel you're going to
implement um some parallelism. So data
uh parallel is is very natural and so we
we'll do that.
um some of the model parallelism like
FSTP turns out to be a bit kind of uh
complicated to do from scratch. So we'll
do sort of a baby version of that. Um
but you know I encourage you to learn
and you know about the full version um
we'll go over the full version in class
but um implementing from scratch might
be a bit you know too much. Um and then
I think an important thing is getting in
the habit of always benchmarking
profile. I think that's actually
probably the most important thing is
that you can implement things, but
unless you have a a feedback on how well
your implementation is going and where
the bottlenecks are, you're just going
to be kind of flying
blind. Okay, so unit three is uh scaling
laws. Um and here the goal is you want
to do experiments at small scale and
figure things out and then um predict
the hyperparameters and loss at large
scale. So here's a fundamental question.
So um if I give you a flops budget, you
know what model size should you use? If
you use a larger model, that means you
can train on less data and if you use a
smaller model, you can train on more
data. So what's the right balance here?
And this has been quite ex studied quite
extensively and figured out by a series
of paper from open air and and deep
mind. So if you hear the term chinchilla
optimal this is what this is referring
to. And the and the basic idea is that
for every compute budget number of flops
you can vary the number of parameters of
your model. Okay. And that and then you
measure how good that model is. So for
every level of compute you can get the
optimal um you know parameter count and
then what you do is you you can fit a a
curve to extrapolate and see if you had
let's say you know one E22 flops you
know what would be the parameter size
and it turns out these minimum when you
plot them it's actually remarkably um
you know linear um which led leads to
this like very actually simple but
useful um rule of thumb which is that if
you have
um a particular um model of size n if
you multiply by 20 that's the number of
tokens you should train on essentially
so that means if I say you know 1.4 four
billion parameter model should be
trained on 28 billion you know tokens.
Okay, but you know this doesn't take
into account inference cost. This is
literally how can you train the best
model regardless of how big that model
is. So there's some limitations here but
it's nonetheless been extremely useful
for model development. So in this
assignment, this is kind of um you know
fun because we define a quote unquote
training API which you can query with a
particular set of hyperparameters. You
specify the architecture you know um and
batch size and so on and we return you a
loss that you your decisions will get
you. Okay. So your job is you have a
flops budget and you're going to try to
figure out how to train a bunch of
models and then gather the data. You're
going to fit a scaling law to the gather
data and then you're going to submit
your prediction on you know what you
would choose to be the hyperparameters,
what model size and and so on um at a
larger scale. Okay. So this is a case
where you have to be really we want to
put you in this position where uh
there's some stakes. I mean this is not
like burning real compute but you know
once you run out of your flops budget
that's that's it. Um so you have to be
very careful in terms of how you
prioritize what experiments uh to run
which is something that the frontier
labs have to do all the time and there
will be a leaderboard uh for this which
is minimize flops minimize loss given
your flops budget.
Question from point 24. Yeah. So if
we're working ahead, should we expect
assignments to change over time or are
these going to be the final assignments?
So the the question is that these links
are from 2024. Um the rough assignments
the the rough structure will be the same
from 2025. There will be some
modifications, but if you look at these,
you should have a pretty good idea of
what to
expect. Okay, so let's go into data now.
Um okay so up until now you've done
you've have scaling laws you have
systems you can you have your
transformer implementation everything
you're really kind of good to go but
data I would say is a really kind of key
ingredient that I think differentiates
in some sense and the the question to
ask here is what do I want this model to
do right because it's what I what the
model does is completely deter I mean
mostly determined by uh the data. If I
put if I train a multilingual data, it
will have multilingual capabilities. If
I train on code, it'll have code
capabilities. It's not, you know, it's
very natural. And usually data sets are
a conglomeration of a lot of different
pieces. There's, you know, uh this is
from a pile which is you know four years
ago, but the same idea I think holds.
You know, you have data from, you know,
the web. This is common crawl. um you
have you know maybe sack exchange,
Wikipedia, GitHub and different you know
sources which are
curated and so in the data section we're
going to start talking about evaluation
which is given a model how do you
evaluate whether it's any good so we're
going to talk about perplexity way
measures standard kind of standardized
testing like MMLU
um if you have models that generate
utterances for instruction following how
do you evaluate that um there's you know
also decisions about if you can enso or
do chain of thought at test time um you
know how does that affect your
evaluation and then you know you can
talk about entire systems um evaluation
of entire system not just a language
model because language models often get
these days plugged into some agentic
system or
something. Um okay so now after
establishing evaluation um let's look at
data curation. So this is I I think an
important point that people don't
realize. I often hear people say oh
we're training the the model on the
internet. This just doesn't make sense
right data doesn't just you know you
know fall from the sky and there's the
internet that you can you know um pipe
into your model. um you know data has to
always be actively acquired uh somehow
um
so even if you you know just just as an
example of you know I always tell people
look at the data um and so let's look at
some data so this is uh some common
crawl um you know data I'm going to take
10 documents and I think hopefully this
works okay I think the rendering is off
but Um you can kind of see uh this is a
this is
a sort of random sample of of common
crawl.
Um and you can see
that this is maybe
um not exactly the data. Oh, here's some
actually real text here. Okay, that's
cool. Um, but if you look at most of
common crawl aside from this is a
different language, but you can also see
this is very spammy sites and you'll
quickly realize that a lot of the web is
just, you know, trash and so well, okay,
maybe that's not that's surprising, but
it's more trash than you would actually
expect. I promise. Um,
so what what I'm saying is that there's
a lot of work that needs to happen in
data. So you can crawl the internet, you
can take books, archives, papers, um,
GitHub. Um, and there's actually a lot
of processing that needs to happen. Um,
you know, there's also legal questions
about what data you can, you know, train
on, which we'll touch on. Um, nowadays,
a lot of frontier models have to
actually buy data. um because the data
on the internet that's publicly uh
accessible is actually uh turns out to
be you know a bit limited for that kind
of the you know the really frontier um
performance and also I I think it's
important to remember that this data
that's scraped it's not actually text
right first of all it's HTML or it's
PDFs or in the case of code it's just
directories so there has to be an
explicit process that takes this data
and turns it into text. Okay, so we're
going to talk about the transformation
from HTML to to text. Um, and this is
going to be a lossy process. Um, so the
trick is how can you preserve the
content and some of the structure um
without um you know basically just
having HTML. Um filtering as you could
you know surmise is going to be very
important both for getting high quality
data but also removing harmful content.
Um generally people train classifiers to
do this. The dduplication is also um an
important step which we'll talk about.
Okay. So assignment four is all about
data. We're going to give you the raw
common crawl, you know, dump so you can
see just how bad it is. And you're going
to train classifiers, ddup, and then
there's going to be a leaderboard where
you're going to try to um minimize
perplexity given your token
budget. So now let's now have the data,
you've done this, built all your fancy
kernels, you've trained, now you can
really train models. But at this point
what you'll get is a model that can
um complete the next token right and
this is called a a essentially a base
model and I think about it as a model
that has a lot of raw potential but it
needs to be aligned or modified some way
and alignment is a process of making it
useful. So in
alignment captures a lot of different
things but three things I think it
captures is that you want to get the
language model to follow instructions
right completing the next token is not
necessarily following the instruction
it'll just complete the instruction or
whatever it thinks will follow the
instruction um you get to here specify
the style of the generation whether you
want to be a long or short whether you
want bullets whether whether you know
you want it to be witty or have SAS or
not. Um and you when you play with um
you know you you know chatbt versus
grock you'll see that there's different
alignment uh that has happened and then
also safety um one important thing is
for these models to be able to refuse
answers that can be you know harmful. So
that's where alignment also kicks in. So
there's generally two phases of
alignment. There's supervised
fine-tuning and here the goal is I mean
it's very simple. You basically gather a
set of um user assistant pairs um so
prompt response pairs and then you do um
supervised learning. Okay. And the idea
here is that the base model already has
the sort of the raw potential. So just
fine-tuning it on um a few examples is
uh sufficient. Of course, the more
examples you have, the better the the
results, but um there's papers like this
one that shows even like a thousand uh
examples suffices to give you
instruction following capabilities from
a base good base model. Okay, so this
part is actually very, you know, simple
and it's not that different from um, you
know, pre-training because it's just
you're given text and you just maximize
the probability of the text.
Um, so the second part is a bit more
interesting from a algorithmic
perspective. So the idea here is that
even with SFT phase, you will have a
decent um, model. And now how do you
improve it? what you can get there more
SFT data but that can be very expensive
because you have to you know annot
someone sit down and annotate data. So
there the goal of learning from feedback
is that you can leverage lighter forms
of annotation um and have the algorithms
do a bit more work. Okay. So one type of
data you can learn from is preference
data. So this is where you generate
multiple responses from a model to a
given prompt like A or B and the user
rates whether A or B is better and so
the data might look like you know it
generates uh you know what's the best
way to train a language model use a
large data set or use a small data set
and of course the answer should be a so
that is a a unit of expressing
preferences. Another type of supervision
you could have is using verifiers. So
for some domains, you're lucky enough to
have a formal verifier like for math or
code. Or you can use learn verifiers
where um you train an actual language
model to um to rate uh the the the
the response. And of course this relates
to evaluation. Again, algorithms. Um
this is you know we're in the realm of
reinforcement learning. So uh one of the
earliest algorithms uh that was
developed that was applied to
instruction um tuning models uh were was
PO proximal policy optimization. Um it
turns out that if you just have
preference data you there's a much
simpler algorithm called DPO that works
really well. Um but in general if you
want to learn uh from verifiers data you
have to it's not preference data so you
have to embrace RL fully and um you know
there's this um method which we'll uh do
in this class which called group
relative preference optimization which
simplifies po makes it more efficient by
removing the value function developed by
deepseek which u seems to work pretty
well. Okay, so assignment five
implements supervised tuning DPO and
GRPO and of course evaluate
question about assignment one. Do people
have similar things to say about
assignments two or Yeah, the question is
um assignment one seems a bit uh
daunting. What about the other ones? I
would say that assignment one and two
are definitely the most heavy and
hardest. Um, assignment three is um a
bit more of a breather and assignment
four and five at least last year were um
I would say a notch below assignment one
or two. Um although I don't know depends
on we haven't fully worked out the
details for this year.
Yeah, it does get
better. Okay, so just to a recap of the
different pieces here. Um, you know,
remember efficiency is this driving
principle and there's a bunch of
different design decisions and you can I
think if you view efficiency
um everything through a lens of
efficiency, I think a lot of things kind
of make sense. Um and importantly I
think you know we are it's worth
pointing out there we are currently um
in this compute constraint regime at
least this class and most people who are
somewhat GPU poor so we have a lot of
data but we don't have that much compute
and so these design decisions will
reflect squeezing the most out of the
hardware so for example data processing
we're filtering fairly aggressively
because we don't want to waste precious
compute on bad or irrelevant data
tokeniz ization like it's it's nice to
have a a model over bytes that's very
elegant but it's very compute
inefficient with today's model
architectures. So we have to do
tokenization to as an efficiency gain
model architecture. There are a lot of
design decisions there that are
essentially motivated by you know
efficiency training. I think the fact
that we're most of what we're doing to
do is just a single epoch. This is
clearly we're in a hurry. Um we just
need to see more data as opposed to
spend a lot of time on any given data
point. Scaling laws is completely about
efficiency. we use less compute to
figure out the hyperparameters. Um and
alignment
is is is maybe a little bit different
but um the connection to efficiency is
that if you can put resources into
alignment then you actually require less
uh you know smaller base models. Okay.
So there is a you know there's sort of
two paths. If your use case is fairly
narrow, you can probably use a smaller
model. You align it or fine-tune it and
you can do well. But if you your use
cases are very broad, then there might
not be a substitute for training a a big
model. So that's today. So increasingly
now um at least for Frontier Labs u
they're becoming data constrained which
is interesting because I think that the
design decisions will presumably
completely change well I mean compute
will always be important but I think the
design decisions will change for example
you know learning taking one epoch of
your data I think doesn't really make
sense if you have more compute why
wouldn't you take more epochs at least
or do something uh smarter
or maybe there will be um different
architectures for example um because a
transformer was really motivated by you
know compute efficiency um so that's
something to kind of ponder still it's
about efficiency but the design
decisions reflect what regime you're
in
okay so now I'm going to dive into the
first uh unit um you know before Not any
questions?
Do you have a slack or
uh the question is we have a slack or we
will have a slack. We'll send out
details um after this class.
Yeah. Will students auditing the course
also have access to the same materials?
Uh the question is students auditing the
class will have access to all the um
online uh you know materials assignments
and we'll give you access to uh Canvas
so you can watch the the uh lecture
videos.
Yeah. What's the grading of the
assignments? What's the grading of the
assignments? Um good question. So there
will be a set of unit tests that uh you
will have to pass. So part of the
grading is just did you implement this
correctly. Um there will be also parts
of the grade which will did you
implement a model that achieved a
certain level of loss or is efficient
enough. Um in the um assignment every
problem part has a number of points
associated with it. And so that gives
you a fairly granular level of what um
grading looks like.
Okay, let's jump into
tokenization. Okay, so um Andre Kapati
has this really nice video on
tokenization and in general he makes a
lot of these videos on um that uh
actually inspired a lot of this class
how you can build things from from
scratch. Um so you should go check out
some of his videos. Um so tokenization
as we talked about it um is the process
of taking raw text which is generally
represented as unic code strings and um
turning it into a set of integers
essentially and where each integer is uh
represents a token. Okay. So we need a
procedure that encodes strings to tokens
and decodes them back into strings. Um
and the vocabulary size is just the
number of values that a a token take on
the number of the range of the
integers. Okay, so just to give you an
example of how tokenizers work, let's uh
play around with this really nice
website which allows you to look at
different tokenizers and just type in
something like, you know, hello uh you
know, hello or or whatever. Um maybe
I'll
um do this. Um and one thing it does is
it shows you the list of integers. This
is the output of tokenizer. It also
nicely maps out um the decomposition of
the the original string into a bunch of
segments. Um and a few few things to
kind of note. First of all, the space is
part of a token. So unlike classical NLP
where the space just kind of disappears
everything is accounted for. These are
meant to be kind of reversible
operations tokenization. Um and by
convention it you know for whatever
reason the the space is usually
preceding um the token. Um, also notice
that you know hello is a completely
different token than uh space hello
which um you might make you a little bit
squeamish but you know seems and it can
cause problems but um that's just how it
is. Question I was going to ask is the
space theme leading instead of trailing
intentional or is it just an artifact of
the BP process? Um so the question is is
the spacing before intentional or not?
Um so in the BP process I will talk
about you actually
pre-tokenize and then you um and then
you tokenize each part and I think the
pre-tokenizer it does put the space in
the front. So it is built into the
algorithm. You could put it at the end
but I think it probably makes more sense
to put in the beginning. Um but um
actually don't well it I guess it could
go either way. It's my sense. Um okay so
then if you look at numbers um you see
that um the numbers are chopped down
into um different you know pieces. Um
it's a little bit kind of interesting
that it's left to right. So it's
definitely not grouping by thousands or
anything like semantic. Um but anyway, I
encourage you to kind of play with it
and get a sense of what these existing
tokenizers look like. Um so this is a
tokenizer for GPT40 for
example. Um so there's some observations
um that we made. Um, so if you look at
the GB22 tokenizer, which will use this
kind of as a reference. Um, okay, let me
see if I can
um, okay, hopefully this is let me know
if this is too getting too small in the
back. Um, you could take a string um, if
you apply the GPD2 tokenizer, you get
your indices. So it maps uh strings to
indices and then you can decode to uh
get back the string and this is just a
sanity check to make sure that um you
actually it round trips. Um another
thing that's I guess interesting to look
at is this compression ratio which is if
you look at the number of bytes divided
by the number of tokens. So how many
bytes are represented by a token and the
answer here is 1.6.
Okay, so every token represents 1.6
bytes of
data. Okay, so that's just a GPT tok to
tokenizer that open air trained. Um to
motivate kind of BPE, I want to go
through a sequence of attempts. Like
suppose you wanted to do tokenization.
What would be the sort of the the
simplest thing? The simplest thing is
probably character-based tokenization. A
unic code string is a sequence of unic
code characters and each character can
be converted into an integer in called a
code point. Okay, so a maps to 97. Um
the world emoji maps to
127,757 and you can see that it converts
back. Okay. So you can define a
tokenizer which simply
um you know maps
uh each character into a code
point.
Okay. So what's one problem with this?
Yeahression ratio is one. The
compression ratio is one. Um so that's
uh well actually the compression ratio
is not quite one because a character is
not a bite. Um, but it's it's maybe not
as good as you want. One problem with
that, if you look at some code points,
they're actually really large, right?
Um, so you're basically allocating each
like one slot in your vocabulary for
every character uniformly. And some
characters appear way more frequently
than others. So this is um not a very
effective use of your kind of
budget. Okay. Um so the vocabulary size
is you huge. I mean the vocabulary size
being 127 is actually a big deal but um
the bigger problem is that some
characters are rare and this is
inefficient use of the vocab.
Um okay so the comparation ratio is um
is 1.5 in this case because it's the
tokens uh sorry the number of bytes per
token and um a character can be multiple
bytes. Okay so that was a very kind of
naive approach. Um on the other hand you
can do bite based tokenization. Okay so
unic code strings can be represented as
sequence of bytes. um
because um every string can just be you
know converted into bytes. Okay. So some
um you know a is already just kind of
one bite but some uh characters uh take
up as many as four bytes and this is
using the UTF8 kind of encoding of unic
code. There's other encodings but this
is the most common one that's dynamic.
So let's just convert everything into
bytes
um and see what
happens. So if you do it into bytes now
all the indices are between 0 and 256
because there only 256 possible values
for a bite by definition. Um so your
vocabulary is very you know small and
each bite is I guess not all bytes are
equally used but you know it's not too
you don't have that many sparsity you
know problems. Um but what's the problem
with bite- based encoding?
Long sequences. Yeah, long sequences. So
this is I mean in some ways I really
wish by coding would work. It's the most
elegant thing but um but you have long
sequences your compression ratio is one.
One bite per
token. And this is just terrible. A
compression ratio of one is terrible
because your sequences will be really
long. attention is quadratic naively in
the sequence lane. So this is you're
just gonna have a bad time in terms of
efficiency. Okay, so that wasn't really
good. Um so now the thing that you might
think about is well maybe we kind of
have to be adaptive here, right? like
you know we can't allocate a character
or a bite per token but maybe some
tokens can represent lots of bytes and
some tokens can represent few bytes. So
one way to do this is wordbased
tokenization and this is something that
was actually very classic in in NLP
right so here's a string and you can
just uh you know split
it into let's a sequence of segments
okay and you can call each of these
tokens so you just use a regular
expression um here's a different regular
expression um that GPT2 uses to
pre-tokenize um and it just splits um um
you know your string into a sequence of
strings. So um and then what you do with
each segment is that you assign each of
these to an integer and then you're
done. Okay. So what's the problem with
this?
Yeah. So the problem is that your
vocabulary size is sort of unbounded.
Well, not maybe not quite unbounded, but
um you don't know how big it is, right?
Because on a given new input, you might
get a segment that's uh that just you've
never seen before. And that's actually
kind of a big problem. This is actually
wordbased is a really big pain in the
butt because um you know some real words
are rare and um you know you actually
it's it's really annoying because new
words have to receive this UNC token um
and if you're not careful about how you
compute you know the perplexity um then
you're just going to mess up um so you
know wordbased isn't I think it captures
the right intuition of adapt activity um
but it's not exactly what we want here.
So here we're finally going to talk
about the BPE encoding or by pair
encoding um so this was actually a very
old algorithm uh developed by Philip
Gage in 94 for data compression um and
it was first introduced into NLP for
neural machine translation. So before
papers that did machine translation or
any basically all NLP used wordbased
tokenization and again wordbased was a
pain. So um you know this paper
pioneered this idea well we can use this
nice algorithm form 94 and we can just
make um the tokenization kind of
roundtrip and we don't have to deal with
anks or any of that stuff. And then
finally this entered the kind of
language modeling um era of through GPD2
which was uh trained on using the BP
tokenizer. Um okay so the basic idea is
instead of defining some sort of
pre preconceived notion of how to split
up we're going to train the tokenizer on
raw text. That's a basic kind of insight
if you will. And so organically common
sequences um that um span multiple
characters we're going to try to
represent as one token and rare
sequences are going to be represented by
multiple tokens.
Um there's a sort of a slight detail
which is to for efficiency the GPD2
paper um uses warbased tokenizer as a
sort of pre-processing to break it up
into segments and then runs BP on each
of the segments which is what you're
going to do in this class as well. Um
the algorithm BP is actually very
simple. So we first convert the string
into a sequence of bytes which we
already did when we talked about bybased
tokenization. And now we're going to
successfully merge the most common pair
of adjacent tokens over and over again.
So the intuition is that if a pair of
tokens that shows up a lot, then we're
going to compress it into one token.
We're going to dedicate space for that.
Okay. So let's walk through what this
algorithm looks like. So we're going to
use this cat and hat as an example and
um uh we're going to convert this into a
sequence of um integers. These are the
bytes. Um, and then we're going to keep
track of what we've merged. So remember
merges is a map from two integers which
can represent bytes or other you know ex
preexisting tokens and we're going to
create a new
token and um the vocab is just going to
kind of be a handy way to represent the
index to to bytes. Um, okay. So, we're
going to the BP algorithm. I mean, it's
very simple. So, I'm just actually going
to run through the code. You're going to
um do this number of times. So, number
is three. In this case, we're going to
first count up uh the number of
occurrences of pairs of bytes. So, um
hopefully this doesn't become too small.
So, we're going to just step through um
this uh sequence and we're going to see
that. Okay. So, what's 116 104? We're
going to increment that count 104 101
increment that count. We're go through
the sequence and we're going to count up
um you know the bytes. Okay. So now
after we have these counts, we're going
to um find the pair that occurs the most
number of times. Um so I guess there's
multiple ones, but we're just going to
break ties and say 116 and
104. Okay, so that occurred twice. Um,
so now we're going to merge that pair.
So we're going to create a new slot in
our vocab, which is going to be
256. So so far it's 0 through 255, but
now we're expanding the vocab to 256.
And we're going to say every time we see
116 and 104, we're going to replace it
with
256. Okay?
And then we're going to um just apply
that merge to our our training set. So
after we do that, the the um 116 104
became 256 and this 256 remember
occurred twice. Okay. So now we're just
going to loop through this algorithm,
you know, one more time. The second
time, um it decided to merge 256 and
101. Um, and now I'm going to replace uh
that in indices. Um, and notice that the
indices is going to shrink, right?
Because our compression ratio is getting
better as we make room for more
vocabulary items and we have a greater
vocabulary to represent everything.
Okay, so let me do this one more time.
Um, and then the next merge is
2573. And this is shrinking one more
time. Okay. And then now we're
done. Okay. So let's try out this
tokenizer. So we have the string, the
quick brown fox. Um we're going to
encode into a sequence of
indices. And then we're going to use our
BP tokenizer to decode. Let's actually
step through what that uh you know looks
like. Um
uh this well actually maybe decoding
isn't actually interesting. Sorry I
should have gone through the encode. Um
let's go back to encode.
Um so encode um you take a string you
convert to indices and you just replay
the merges in and importantly in the
order that they occur. So, I'm going to
replay um these merges and and then
um and then I'm going to get my indices.
Okay. And then verify that this uh
works. Okay. So, that was um it's pretty
simple. The you know it's because it's
simple it's it was also very
inefficient. For example, encode loops
over the merges. You should only loops
over the merges that matter. Um, and
there's some other bells and whistles
like there's special tokens,
pre-tokenization. And so in your
assignment, you're going to essentially
take this as a starting point and or I
mean I guess you should implement your
own from scratch. Um, but your goal is
to make the implementation, you know,
fast. Um, and you can like paralyze it
if you want. Um, you can go have
fun. Okay, so summary of tokenization.
So tokenizer maps between strings and
sequences of integers. Um we looked at
characterbased bite-based wordbased.
They're highly suboptimal uh for various
reasons. BPE is a very old algorithm
from 94 that still proves to be
effective horistic. And the important
thing is that looks at your corpus
statistics to make sensible decisions
about how to best adaptively allocate um
vocabulary to represent sequences of
characters. Um and you know I hope that
one day I won't have to give this
lecture because we'll just have
architectures that um map fromtes but
until then um we'll have to deal with
tokenization. Okay. So that's it for
today. Next time we're going to dive
into the details of PyTorch um and give
you the building blocks and pay
attention to resource accounting. All of
you have presumably implemented, you
know, PyTorch programs, but we're going
to really look at where all the flops
are going. Okay, see you next time.
Loading video analysis...