Tokenization and Byte Pair Encoding
By Serrano.Academy
Summary
Topics Covered
- LLMs Process Tokens, Not Words
- BPE Merges Frequent Pairs
- BPE Captures Suffix Semantics
Full Transcript
Hello, I'm Louis Srano and this is Srano Academy. And in this video, I'm going to
Academy. And in this video, I'm going to tell you about tokenization and BPE.
That's bite pair encoding. Tokens are
really the building blocks of LLMs because LLMs don't work with words or with letters. They work with these units
with letters. They work with these units called tokens. This video is on a long
called tokens. This video is on a long series of videos that I have about transformers attention positional encoding, all that. And you can find the links in the description of the video.
So, in these videos, I talk about the architecture of a transformer. This way
you have several pieces tokenization embeddings positional encodings and the meat of the transformer which is the feed forward neural network and attention and finally a soft max layer at the very end in order to put an extra
word. So if your prompt is write a story
word. So if your prompt is write a story the output is once the next word that comes after write a story. So let's get to tokenization which is the first piece of the architecture and tokenization
works as follows. Let's say you have the sentence write a story. As I said the model doesn't look at the word write the word a and the word story. It actually
looks at them as tokens. And for the most part, tokens are words. Each word
is a token. But sometimes you have more complicated words like doesn't, which turns into two tokens, does and end.
Now, not every tokenizer does this. Some
do it differently. But the idea is that you're breaking down the words into sensible pieces. Now, there are two
sensible pieces. Now, there are two extremes, right? You could have the most
extremes, right? You could have the most coarse tokenizer where every single word that you put is a token. That's way too many tokens for a model to process. And
the other is to make it as fine as possible, which is every single character you input is a token. Then you
don't have that many tokens in your library, but you're not adding much information. And then your text could be
information. And then your text could be short, but it just consists of a lot of tokens. And since every token has to be
tokens. And since every token has to be processed every time through the entire neural network, then this results in a very, very costly process. So what you want is something in the middle,
something where most words are tokens, but complicated words can turn into two or three or even more tokens in a way that is logical and sensible. So a demo that works pretty well is the OpenAI
tokenizer. You can look at the link
tokenizer. You can look at the link upstairs. And as you can see, this is
upstairs. And as you can see, this is the tokenizer. Every word is a token and
the tokenizer. Every word is a token and periods are tokens and exclamation signs, all that are tokens. But if you have a really long word like anti-isestablishmentarianism,
well that gets broken into suffixes and prefixes. Now the question is how do we
prefixes. Now the question is how do we build these tokens? Well, there are many useful ways to do it, but I'm going to tell you one which is called bite parent coding and it's pretty logical. So let's
say that our universe consists of four words. Hug, hugs, bug, and bugs. So
words. Hug, hugs, bug, and bugs. So
we're going to tokenize that dictionary over there of words. Let's start in the finest possible way which is every letter is a token. So I have five tokens. Now let's say that that's too
tokens. Now let's say that that's too much and I want to save some work. Like
every time I have to type hugs I have to type four tokens. I would like to do less work. So what's the smartest way I
less work. So what's the smartest way I can do this? In other words, which two letters would you join in order to do less amount of work? Well, let's break
every word into its tokens. H UG, H UGS, etc., etc. And now what we're going to do is we're going to say the following.
The most common pair of letters appearing or existing tokens appearing.
We're going to join them and put a new keystroke there. So what are they? Well,
keystroke there. So what are they? Well,
HU appears twice, UG appears four times, GS appears twice, and BU appears twice.
So it seems that if I take the U and the G and join them into one keystroke, then every time I run into UG, I have to press just one keystrokes instead of two. And then I save a lot of work. So
two. And then I save a lot of work. So
that's going to be my first step. I'm
going to join UG into one token. So now
I have one more token. I have H, UGSB, and UG. That's one token. So now what
and UG. That's one token. So now what happens? Well, now I can type them with
happens? Well, now I can type them with just two or three keystrokes. And these
are my new tokens. Now, let's do it one more time. Let's see what other two
more time. Let's see what other two symbols would you join or what other two tokens in this list would you join?
Well, let's see how many times does H and U appear twice. UG and S appears twice and B and U appears twice and the rest appears zero times. So, we could go for any of those three. When we have
ties, we just pick randomly. So, let's
pick hug. And now hug is a new token. So
in our new list now hug appears as a new token and look how the words became. Now
the first word is a token the second word is two tokens and the third and fourth word are still split. So let's do it one last time. Now we see of the existing tokens which two appear
together more often? Well, it turns out that B and U appear twice and the rest of pairs appear once or zero times. So
we're going to join B and U into bug.
And so now we have a new list which is the original five letters and then UG, H U and B U. And let's say we stop right there. So now we have some pretty good
there. So now we have some pretty good tokens. Those tokens are here in step
tokens. Those tokens are here in step four and they split the words pretty well. Notice something interesting and
well. Notice something interesting and it's that that S over there kind of gives you information. It tells you if you're talking about the singular or the plural. And that's something very
plural. And that's something very interesting of BP and is that it retains some information. For example, let's say
some information. For example, let's say I have the words walked, the word talked, and some unknown word bis. I
have no idea what that means. And when
we put them through the tokenizer, then notice what happens. Walk becomes walk and the suffix ed. Talked becomes talk and the suffix ed. And then bisque
becomes something the model didn't recognize. So it split it into pretty
recognize. So it split it into pretty much letters or small tokens, but it still has that suffix ed. So whatever it is, we know it's the past tense of something. So what the tokenizer tells
something. So what the tokenizer tells us is that the first word is the past of walk. The second one is the past of
walk. The second one is the past of talk. And the third one is the past of
talk. And the third one is the past of whatever bisque is. So we don't know what the word is, but we know that this is very likely to be is past tense. If
you try different past tense, you may see that not always the suffix ed is by itself. Sometimes it appears with an
itself. Sometimes it appears with an extra letter. This is because these
extra letter. This is because these tokenizers may have been trained on a really really big corpus and maybe those prefixes and suffixes mean something else. So that's all for today. Thank you
else. So that's all for today. Thank you
very much for your attention. I would
like to thank my sponsors and the channel who helped me do what I love which is teaching the world. And also
I'd like to thank the folks from deep learning in Dava in particular Annie who were actually very helpful with this material because it was originally created for a workshop at deep learning
in Dava conference in Rwanda. If you
like what you see, then definitely check out my page, srano.academy. In there, I have a lot of courses on machine learning, classical machine learning, generative machine learning, LLMs,
agents, you name it. And they have videos, they have code and text. They're
all free, so definitely check it out.
And finally, if you like this, please subscribe to my channel, srano.academy,
or more notifications and to be updated of new videos. If you want to go further, you can also support me in Patreon or join the channel and then you get cool perks. For example, you can get
early access to videos, monkey Q&As's with me, or even your name on the video.
But of course, my videos will always be free. You can also follow me on Twitter,
free. You can also follow me on Twitter, Sranoacademy. And again, here's my page,
Sranoacademy. And again, here's my page, srano.academy with courses. Thank you
srano.academy with courses. Thank you
very much and see you in the next
Loading video analysis...