LongCut logo

Tokenization and Byte Pair Encoding

By Serrano.Academy

Summary

Topics Covered

  • LLMs Process Tokens, Not Words
  • BPE Merges Frequent Pairs
  • BPE Captures Suffix Semantics

Full Transcript

Hello, I'm Louis Srano and this is Srano Academy. And in this video, I'm going to

Academy. And in this video, I'm going to tell you about tokenization and BPE.

That's bite pair encoding. Tokens are

really the building blocks of LLMs because LLMs don't work with words or with letters. They work with these units

with letters. They work with these units called tokens. This video is on a long

called tokens. This video is on a long series of videos that I have about transformers attention positional encoding, all that. And you can find the links in the description of the video.

So, in these videos, I talk about the architecture of a transformer. This way

you have several pieces tokenization embeddings positional encodings and the meat of the transformer which is the feed forward neural network and attention and finally a soft max layer at the very end in order to put an extra

word. So if your prompt is write a story

word. So if your prompt is write a story the output is once the next word that comes after write a story. So let's get to tokenization which is the first piece of the architecture and tokenization

works as follows. Let's say you have the sentence write a story. As I said the model doesn't look at the word write the word a and the word story. It actually

looks at them as tokens. And for the most part, tokens are words. Each word

is a token. But sometimes you have more complicated words like doesn't, which turns into two tokens, does and end.

Now, not every tokenizer does this. Some

do it differently. But the idea is that you're breaking down the words into sensible pieces. Now, there are two

sensible pieces. Now, there are two extremes, right? You could have the most

extremes, right? You could have the most coarse tokenizer where every single word that you put is a token. That's way too many tokens for a model to process. And

the other is to make it as fine as possible, which is every single character you input is a token. Then you

don't have that many tokens in your library, but you're not adding much information. And then your text could be

information. And then your text could be short, but it just consists of a lot of tokens. And since every token has to be

tokens. And since every token has to be processed every time through the entire neural network, then this results in a very, very costly process. So what you want is something in the middle,

something where most words are tokens, but complicated words can turn into two or three or even more tokens in a way that is logical and sensible. So a demo that works pretty well is the OpenAI

tokenizer. You can look at the link

tokenizer. You can look at the link upstairs. And as you can see, this is

upstairs. And as you can see, this is the tokenizer. Every word is a token and

the tokenizer. Every word is a token and periods are tokens and exclamation signs, all that are tokens. But if you have a really long word like anti-isestablishmentarianism,

well that gets broken into suffixes and prefixes. Now the question is how do we

prefixes. Now the question is how do we build these tokens? Well, there are many useful ways to do it, but I'm going to tell you one which is called bite parent coding and it's pretty logical. So let's

say that our universe consists of four words. Hug, hugs, bug, and bugs. So

words. Hug, hugs, bug, and bugs. So

we're going to tokenize that dictionary over there of words. Let's start in the finest possible way which is every letter is a token. So I have five tokens. Now let's say that that's too

tokens. Now let's say that that's too much and I want to save some work. Like

every time I have to type hugs I have to type four tokens. I would like to do less work. So what's the smartest way I

less work. So what's the smartest way I can do this? In other words, which two letters would you join in order to do less amount of work? Well, let's break

every word into its tokens. H UG, H UGS, etc., etc. And now what we're going to do is we're going to say the following.

The most common pair of letters appearing or existing tokens appearing.

We're going to join them and put a new keystroke there. So what are they? Well,

keystroke there. So what are they? Well,

HU appears twice, UG appears four times, GS appears twice, and BU appears twice.

So it seems that if I take the U and the G and join them into one keystroke, then every time I run into UG, I have to press just one keystrokes instead of two. And then I save a lot of work. So

two. And then I save a lot of work. So

that's going to be my first step. I'm

going to join UG into one token. So now

I have one more token. I have H, UGSB, and UG. That's one token. So now what

and UG. That's one token. So now what happens? Well, now I can type them with

happens? Well, now I can type them with just two or three keystrokes. And these

are my new tokens. Now, let's do it one more time. Let's see what other two

more time. Let's see what other two symbols would you join or what other two tokens in this list would you join?

Well, let's see how many times does H and U appear twice. UG and S appears twice and B and U appears twice and the rest appears zero times. So, we could go for any of those three. When we have

ties, we just pick randomly. So, let's

pick hug. And now hug is a new token. So

in our new list now hug appears as a new token and look how the words became. Now

the first word is a token the second word is two tokens and the third and fourth word are still split. So let's do it one last time. Now we see of the existing tokens which two appear

together more often? Well, it turns out that B and U appear twice and the rest of pairs appear once or zero times. So

we're going to join B and U into bug.

And so now we have a new list which is the original five letters and then UG, H U and B U. And let's say we stop right there. So now we have some pretty good

there. So now we have some pretty good tokens. Those tokens are here in step

tokens. Those tokens are here in step four and they split the words pretty well. Notice something interesting and

well. Notice something interesting and it's that that S over there kind of gives you information. It tells you if you're talking about the singular or the plural. And that's something very

plural. And that's something very interesting of BP and is that it retains some information. For example, let's say

some information. For example, let's say I have the words walked, the word talked, and some unknown word bis. I

have no idea what that means. And when

we put them through the tokenizer, then notice what happens. Walk becomes walk and the suffix ed. Talked becomes talk and the suffix ed. And then bisque

becomes something the model didn't recognize. So it split it into pretty

recognize. So it split it into pretty much letters or small tokens, but it still has that suffix ed. So whatever it is, we know it's the past tense of something. So what the tokenizer tells

something. So what the tokenizer tells us is that the first word is the past of walk. The second one is the past of

walk. The second one is the past of talk. And the third one is the past of

talk. And the third one is the past of whatever bisque is. So we don't know what the word is, but we know that this is very likely to be is past tense. If

you try different past tense, you may see that not always the suffix ed is by itself. Sometimes it appears with an

itself. Sometimes it appears with an extra letter. This is because these

extra letter. This is because these tokenizers may have been trained on a really really big corpus and maybe those prefixes and suffixes mean something else. So that's all for today. Thank you

else. So that's all for today. Thank you

very much for your attention. I would

like to thank my sponsors and the channel who helped me do what I love which is teaching the world. And also

I'd like to thank the folks from deep learning in Dava in particular Annie who were actually very helpful with this material because it was originally created for a workshop at deep learning

in Dava conference in Rwanda. If you

like what you see, then definitely check out my page, srano.academy. In there, I have a lot of courses on machine learning, classical machine learning, generative machine learning, LLMs,

agents, you name it. And they have videos, they have code and text. They're

all free, so definitely check it out.

And finally, if you like this, please subscribe to my channel, srano.academy,

or more notifications and to be updated of new videos. If you want to go further, you can also support me in Patreon or join the channel and then you get cool perks. For example, you can get

early access to videos, monkey Q&As's with me, or even your name on the video.

But of course, my videos will always be free. You can also follow me on Twitter,

free. You can also follow me on Twitter, Sranoacademy. And again, here's my page,

Sranoacademy. And again, here's my page, srano.academy with courses. Thank you

srano.academy with courses. Thank you

very much and see you in the next

Loading...

Loading video analysis...