Let's build the GPT Tokenizer
By Andrej Karpathy
Summary
## Key takeaways - **Tokenization is the root of many LLM quirks**: Many issues in LLMs, such as poor spelling or difficulty with non-English languages, can often be traced back to the tokenization process, not just the model architecture. [19:00], [05:00:00] - **Byte Pair Encoding (BPE) compresses text into tokens**: BPE iteratively merges the most frequent pairs of bytes in a text corpus to create a vocabulary of tokens, effectively compressing the data for LLMs. [02:43:00], [24:20:00] - **GPT-4 tokenizer is more efficient than GPT-2**: The GPT-4 tokenizer uses a larger vocabulary and improved regex patterns, significantly reducing token counts for the same text compared to GPT-2, especially for code. [12:44:00], [14:05:00] - **Special tokens are crucial for LLM structure**: Special tokens like 'end of text' or conversation delimiters (e.g., 'im start') are added to the vocabulary and require model surgery to integrate, enabling structured data processing. [57:00:00], [01:18:38] - **Untrained tokens cause unpredictable LLM behavior**: Tokens that appear in the tokenizer's training data but not in the LLM's training data can lead to undefined behavior, as seen with 'solid gold Magikarp', because their embeddings are never trained. [04:51:00], [01:53:57]
Topics Covered
- Tokenization causes many of your LLM's biggest problems.
- Why non-English languages are less efficient for LLMs.
- GPT-4's coding improved by fixing its tokenizer.
- How Byte Pair Encoding iteratively compresses text sequences.
- An LLM glitch reveals a deep tokenization vulnerability.
Full Transcript
hi everyone so in this video I'd like us
to cover the process of tokenization in
large language models now you see here
that I have a set face and that's
because uh tokenization is my least
favorite part of working with large
language models but unfortunately it is
necessary to understand in some detail
because it it is fairly hairy gnarly and
there's a lot of hidden foot guns to be
aware of and a lot of oddness with large
language models typically traces back to
tokenization so what is
tokenization now in my previous video
Let's Build GPT from scratch uh we
actually already did tokenization but we
did a very naive simple version of
tokenization so when you go to the
Google colab for that video uh you see
here that we loaded our training set and
our training set was this uh Shakespeare
uh data set now in the beginning the
Shakespeare data set is just a large
string in Python it's just text and so
the question is how do we plug text into
large language models and in this case
here we created a vocabulary of 65
possible characters that we saw occur in
this string these were the possible
characters and we saw that there are 65
of them and then we created a a lookup
table for converting from every possible
character a little string piece into a
token an
integer so here for example we tokenized
the string High there and we received
this sequence of
tokens and here we took the first 1,000
characters of our data set and we
encoded it into tokens and because it is
this is character level we received
1,000 tokens in a sequence so token 18
47
Etc now later we saw that the way we
plug these tokens into the language
model is by using an embedding
table and so basically if we have 65
possible tokens then this embedding
table is going to have 65 rows and
roughly speaking we're taking the
integer associated with every single
sing Le token we're using that as a
lookup into this table and we're
plucking out the corresponding row and
this row is a uh is trainable parameters
that we're going to train using back
propagation and this is the vector that
then feeds into the Transformer um and
that's how the Transformer Ser of
perceives every single
token so here we had a very naive
tokenization process that was a
character level tokenizer but in
practice in state-ofthe-art uh language
models people use a lot more complicated
schemes unfortunately
uh for constructing these uh token
vocabularies so we're not dealing on the
Character level we're dealing on chunk
level and the way these um character
chunks are constructed is using
algorithms such as for example the bik
pair in coding algorithm which we're
going to go into in detail um and cover
in this video I'd like to briefly show
you the paper that introduced a bite
level encoding as a mechanism for
tokenization in the context of large
language models and I would say that
that's probably the gpt2 paper and if
you scroll down here to the section
input representation this is where they
cover tokenization the kinds of
properties that you'd like the
tokenization to have and they conclude
here that they're going to have a
tokenizer where you have a vocabulary of
50,2 57 possible
tokens and the context size is going to
be 1,24 tokens so in the in in the
attention layer of the Transformer
neural network
every single token is attending to the
previous tokens in the sequence and it's
going to see up to 1,24 tokens so tokens
are this like fundamental unit um the
atom of uh large language models if you
will and everything is in units of
tokens everything is about tokens and
tokenization is the process for
translating strings or text into
sequences of tokens and uh vice versa
when you go into the Llama 2 paper as
well I can show you that when you search
token you're going to get get 63 hits um
and that's because tokens are again
pervasive so here they mentioned that
they trained on two trillion tokens of
data and so
on so we're going to build our own
tokenizer luckily the bite be encoding
algorithm is not uh that super
complicated and we can build it from
scratch ourselves and we'll see exactly
how this works before we dive into code
I'd like to give you a brief Taste of
some of the complexities that come from
the tokenization because I just want to
make sure that we motivate it
sufficiently for why we are doing all
this and why this is so gross so
tokenization is at the heart of a lot of
weirdness in large language models and I
would advise that you do not brush it
off a lot of the issues that may look
like just issues with the new network
architecture or the large language model
itself are actually issues with the
tokenization and fundamentally Trace uh
back to it so if you've noticed any
issues with large language models can't
you know not able to do spelling tasks
very easily that's usually due to
tokenization simple string processing
can be difficult for the large language
model to perform
natively uh non-english languages can
work much worse and to a large extent
this is due to
tokenization sometimes llms are bad at
simple arithmetic also can trace be
traced to
tokenization uh gbt2 specifically would
have had quite a bit more issues with
python than uh future versions of it due
to tokenization there's a lot of other
issues maybe you've seen weird warnings
about a trailing whites space this is a
tokenization issue um
if you had asked GPT earlier about solid
gold Magikarp and what it is you would
see the llm go totally crazy and it
would start going off about a completely
unrelated tangent topic maybe you've
been told to use yl over Json in
structure data all of that has to do
with tokenization so basically
tokenization is at the heart of many
issues I will look back around to these
at the end of the video but for now let
me just um skip over it a little bit and
let's go to this web app um the Tik
tokenizer bell.app so I have it loaded
here and what I like about this web app
is that tokenization is running a sort
of live in your browser in JavaScript so
you can just type here stuff hello world
and the whole string
rokenes so here what we see on uh the
left is a string that you put in on the
right we're currently using the gpt2
tokenizer we see that this string that I
pasted here is currently tokenizing into
300 tokens and here they are sort of uh
shown explicitly in different colors for
every single token so for example uh
this word tokenization became two tokens
the token
3,642 and
1,634 the token um space is is token 318
so be careful on the bottom you can show
white space and keep in mind that there
are spaces and uh sln new line
characters in here but you can hide them
for
clarity the token space at is token 379
the to the Token space the is 262 Etc so
you notice here that the space is part
of that uh token
chunk now so this is kind of like how
our English sentence broke up and that
seems all well and good now now here I
put in some arithmetic so we see that uh
the token 127 Plus and then token six
space 6 followed by 77 so what's
happening here is that 127 is feeding in
as a single token into the large
language model but the um number 677
will actually feed in as two separate
tokens and so the large language model
has to sort of um take account of that
and process it correctly in its Network
and see here 804 will be broken up into
two tokens and it's is all completely
arbitrary and here I have another
example of four-digit numbers and they
break up in a way that they break up and
it's totally arbitrary sometimes you
have um multiple digits single token
sometimes you have individual digits as
many tokens and it's all kind of pretty
arbitrary and coming out of the
tokenizer here's another example we have
the string egg and you see here that
this became two
tokens but for some reason when I say I
have an egg you see when it's a space
egg it's two token it's sorry it's a
single token so just egg by itself in
the beginning of a sentence is two
tokens but here as a space egg is
suddenly a single token uh for the exact
same string okay here lowercase egg
turns out to be a single token and in
particular notice that the color is
different so this is a different token
so this is case sensitive and of course
a capital egg would also be different
tokens and again um this would be two
tokens arbitrarily so so for the same
concept egg depending on if it's in the
beginning of a sentence at the end of a
sentence lowercase uppercase or mixed
all this will be uh basically very
different tokens and different IDs and
the language model has to learn from raw
data from all the internet text that
it's going to be training on that these
are actually all the exact same concept
and it has to sort of group them in the
parameters of the neural network and
understand just based on the data
patterns that these are all very similar
but maybe not almost exactly similar but
but very very similar
um after the EG demonstration here I
have um an introduction from open a eyes
chbt in Korean so manaso Pang uh Etc uh
so this is in Korean and the reason I
put this here is because you'll notice
that um non-english languages work
slightly worse in Chachi part of this is
because of course the training data set
for Chachi is much larger for English
and for everything else but the same is
true not just for the large language
model itself but also for the tokenizer
so when we train the tokenizer we're
going to see that there's a training set
as well and there's a lot more English
than non-english and what ends up
happening is that we're going to have a
lot more longer tokens for
English so how do I put this if you have
a single sentence in English and you
tokenize it you might see that it's 10
tokens or something like that but if you
translate that sentence into say Korean
or Japanese or something else you'll
typically see that the number of tokens
used is much larger and that's because
the chunks here are a lot more broken up
so we're using a lot more tokens for the
exact same thing and what this does is
it bloats up the sequence length of all
the documents so you're using up more
tokens and then in the attention of the
Transformer when these tokens try to
attend each other you are running out of
context um in the maximum context length
of that Transformer and so basically all
the non-english text is stretched out
from the perspective of the Transformer
and this just has to do with the um
trainings that used for the tokenizer
and the tokenization itself so it will
create a lot bigger tokens and a lot
larger groups in English and it will
have a lot of little boundaries for all
the other non-english text um so if we
translated this into English it would be
significantly fewer
tokens the final example I have here is
a little snippet of python for doing FS
buuz and what I'd like you to notice is
look all these individual spaces are all
separate tokens they are token
220 so uh 220 220 220 220 and then space
if is a single token and so what's going
on here is that when the Transformer is
going to consume or try to uh create
this text it needs to um handle all
these spaces individually they all feed
in one by one into the entire
Transformer in the sequence and so this
is being extremely wasteful tokenizing
it in this way and so as a result of
that gpt2 is not very good with python
and it's not anything to do with coding
or the language model itself it's just
that if he use a lot of indentation
using space in Python like we usually do
uh you just end up bloating out all the
text and it's separated across way too
much of the sequence and we are running
out of the context length in the
sequence uh that's roughly speaking
what's what's happening we're being way
too wasteful we're taking up way too
much token space now we can also scroll
up here and we can change the tokenizer
so note here that gpt2 tokenizer creates
a token count of 300 for this string
here we can change it to CL 100K base
which is the GPT for tokenizer and we
see that the token count drops to 185 so
for the exact same string we are now
roughly having the number of tokens and
roughly speaking this is because uh the
number of tokens in the GPT 4 tokenizer
is roughly double that of the number of
tokens in the gpt2 tokenizer so we went
went from roughly 50k to roughly 100K
now you can imagine that this is a good
thing because the same text is now
squished into half as many tokens so uh
this is a lot denser input to the
Transformer and in the Transformer every
single token has a finite number of
tokens before it that it's going to pay
attention to and so what this is doing
is we're roughly able to see twice as
much text as a context for what token to
predict next uh because of this change
but of course just increasing the number
of tokens is uh not strictly better
infinitely uh because as you increase
the number of tokens now your embedding
table is um sort of getting a lot larger
and also at the output we are trying to
predict the next token and there's the
soft Max there and that grows as well
we're going to go into more detail later
on this but there's some kind of a Sweet
Spot somewhere where you have a just
right number of tokens in your
vocabulary where everything is
appropriately dense and still fairly
efficient now one thing I would like you
to note specifically for the gp4
tokenizer is that the handling of the
white space for python has improved a
lot you see that here these four spaces
are represented as one single token for
the three spaces here and then the token
SPF and here seven spaces were all
grouped into a single token so we're
being a lot more efficient in how we
represent Python and this was a
deliberate Choice made by open aai when
they designed the gp4 tokenizer and they
group a lot more space into a single
character what this does is this
densifies Python and therefore we can
attend to more code before it when we're
trying to predict the next token in the
sequence and so the Improvement in the
python coding ability from gbt2 to gp4
is not just a matter of the language
model and the architecture and the
details of the optimization but a lot of
the Improvement here is also coming from
the design of the tokenizer and how it
groups characters into tokens okay so
let's now start writing some code
so remember what we want to do we want
to take strings and feed them into
language models for that we need to
somehow tokenize strings into some
integers in some fixed vocabulary and
then we will use those integers to make
a look up into a lookup table of vectors
and feed those vectors into the
Transformer as an input now the reason
this gets a little bit tricky of course
is that we don't just want to support
the simple English alphabet we want to
support different kinds of languages so
this is anango in Korean which is hello
and we also want to support many kinds
of special characters that we might find
on the internet for example
Emoji so how do we feed this text into
uh
Transformers well how's the what is this
text anyway in Python so if you go to
the documentation of a string in Python
you can see that strings are immutable
sequences of Unicode code
points okay what are Unicode code points
we can go to PDF so Unicode code points
are defined by the Unicode Consortium as
part of the Unicode standard and what
this is really is that it's just a
definition of roughly 150,000 characters
right now and roughly speaking what they
look like and what integers um represent
those characters so it says 150,000
characters across 161 scripts as of
right now so if you scroll down here you
can see that the standard is very much
alive the latest standard 15.1 in
September
2023 and basically this is just a way to
define lots of types of
characters like for example all these
characters across different scripts so
the way we can access the unic code code
Point given Single Character is by using
the or function in Python so for example
I can pass in Ord of H and I can see
that for the Single Character H the unic
code code point is
104 okay um but this can be arbitr
complicated so we can take for example
our Emoji here and we can see that the
code point for this one is
128,000 or we can take
un and this is 50,000 now keep in mind
you can't plug in strings here because
you uh this doesn't have a single code
point it only takes a single uni code
code Point character and tells you its
integer so in this way we can look
up all the um characters of this
specific string and their code points so
or of X forx in this string and we get
this encoding here now see here we've
already turned the raw code points
already have integers so why can't we
simply just use these integers and not
have any tokenization at all why can't
we just use this natively as is and just
use the code Point well one reason for
that of course is that the vocabulary in
that case would be quite long so in this
case for Unicode the this is a
vocabulary of
150,000 different code points but more
worryingly than that I think the Unicode
standard is very much alive and it keeps
changing and so it's not kind of a
stable representation necessarily that
we may want to use directly so for those
reasons we need something a bit better
so to find something better we turn to
encodings so if we go to the Wikipedia
page here we see that the Unicode
consortion defines three types of
encodings utf8 UTF 16 and UTF 32 these
encoding are the way by which we can
take Unicode text and translate it into
binary data or by streams utf8 is by far
the most common uh so this is the utf8
page now this Wikipedia page is actually
quite long but what's important for our
purposes is that utf8 takes every single
Cod point and it translates it to a by
stream and this by stream is between one
to four bytes so it's a variable length
encoding so depending on the Unicode
Point according to the schema you're
going to end up with between 1 to four
bytes for each code point on top of that
there's utf8 uh
utf16 and UTF 32 UTF 32 is nice because
it is fixed length instead of variable
length but it has many other downsides
as well so the full kind of spectrum of
pros and cons of all these different
three encodings are beyond the scope of
this video I just like to point out that
I enjoyed this block post and this block
post at the end of it also has a number
of references that can be quite useful
uh one of them is uh utf8 everywhere
Manifesto um and this Manifesto
describes the reason why utf8 is
significantly preferred and a lot nicer
than the other encodings and why it is
used a lot more prominently um on the
internet one of the major advantages
just just to give you a sense is that
utf8 is the only one of these that is
backwards compatible to the much simpler
asky encoding of text um but I'm not
going to go into the full detail in this
video so suffice to say that we like the
utf8 encoding and uh let's try to take
the string and see what we get if we
encoded into
utf8 the string class in Python actually
has do encode and you can give it the
encoding which is say utf8 now we get
out of this is not very nice because
this is the bytes is a bytes object and
it's not very nice in the way that it's
printed so I personally like to take it
through list because then we actually
get the raw B
of this uh encoding so this is the raw
byes that represent this string
according to the utf8 en coding we can
also look at utf16 we get a slightly
different by stream and we here we start
to see one of the disadvantages of utf16
you see how we have zero Z something Z
something Z something we're starting to
get a sense that this is a bit of a
wasteful encoding and indeed for simple
asky characters or English characters
here uh we just have the structure of 0
something Z something and it's not
exactly nice same for UTF 32 when we
expand this we can start to get a sense
of the wastefulness of this encoding for
our purposes you see a lot of zeros
followed by
something and so uh this is not
desirable so suffice it to say that we
would like to stick with utf8 for our
purposes however if we just use utf8
naively these are by streams so that
would imply a vocabulary length of only
256 possible tokens uh but this this
vocabulary size is very very small what
this is going to do if we just were to
use it naively is that all of our text
would be stretched out over very very
long sequences of bytes and so
um what what this does is that certainly
the embeding table is going to be tiny
and the prediction at the top at the
final layer is going to be very tiny but
our sequences are very long and remember
that we have pretty finite um context
length and the attention that we can
support in a transformer for
computational reasons and so we only
have as much context length but now we
have very very long sequences and this
is just inefficient and it's not going
to allow us to attend to sufficiently
long text uh before us for the purposes
of the next token prediction task so we
don't want to use the raw bytes of the
utf8 encoding we want to be able to
support larger vocabulary size that we
can tune as a hyper
but we want to stick with the utf8
encoding of these strings so what do we
do well the answer of course is we turn
to the bite pair encoding algorithm
which will allow us to compress these
bite sequences um to a variable amount
so we'll get to that in a bit but I just
want to briefly speak to the fact that I
would love nothing more than to be able
to feed raw bite sequences into uh
language models in fact there's a paper
about how this could potentially be done
uh from Summer last last year now the
problem is you actually have to go in
and you have to modify the Transformer
architecture because as I mentioned
you're going to have a problem where the
attention will start to become extremely
expensive because the sequences are so
long and so in this paper they propose
kind of a hierarchical structuring of
the Transformer that could allow you to
just feed in raw bites and so at the end
they say together these results
establish the viability of tokenization
free autor regressive sequence modeling
at scale so tokenization free would
indeed be amazing we would just feed B
streams directly into our models but
unfortunately I don't know that this has
really been proven out yet by
sufficiently many groups and a
sufficient scale uh but something like
this at one point would be amazing and I
hope someone comes up with it but for
now we have to come back and we can't
feed this directly into language models
and we have to compress it using the B
paare encoding algorithm so let's see
how that works so as I mentioned the B
paare encoding algorithm is not all that
complicated and the Wikipedia page is
actually quite instructive as far as the
basic idea goes go what we're doing is
we have some kind of a input sequence uh
like for example here we have only four
elements in our vocabulary a b c and d
and we have a sequence of them so
instead of bytes let's say we just have
four a vocab size of
four the sequence is too long and we'd
like to compress it so what we do is
that we iteratively find the pair of uh
tokens that occur the most
frequently and then once we've
identified that pair we repl replace
that pair with just a single new token
that we append to our vocabulary so for
example here the bite pair AA occurs
most often so we mint a new token let's
call it capital Z and we replace every
single occurrence of AA by Z so now we
have two Z's here so here we took a
sequence of 11 characters with
vocabulary size four and we've converted
it to a um sequence of only nine tokens
but now with a vocabulary of five
because we have a fifth vocabulary
element that we just created and it's Z
standing for concatination of AA and we
can again repeat this process so we
again look at the sequence and identify
the pair of tokens that are most
frequent let's say that that is now AB
well we are going to replace AB with a
new token that we meant call Y so y
becomes ab and then every single
occurrence of ab is now replaced with y
so we end up with this so now we only
have 1 2 3 4 5 6 seven characters in our
sequence but we have not just um four
vocabulary elements or five but now we
have six and for the final round we
again look through the sequence find
that the phrase zy or the pair zy is
most common and replace it one more time
with another um character let's say x so
X is z y and we replace all curses of zy
and we get this following sequence so
basically after we have gone through
this process instead of having a um
sequence of
11 uh tokens with a vocabulary length of
four we now have a sequence of 1 2 3
four five tokens but our vocabulary
length now is seven and so in this way
we can iteratively compress our sequence
I we Mint new tokens so in the in the
exact same way we start we start out
with bite sequences so we have 256
vocabulary size but we're now going to
go through these and find the bite pairs
that occur the most and we're going to
iteratively start minting new tokens
appending them to our vocabulary and
replacing things and in this way we're
going to end up with a compressed
training data set and also an algorithm
for taking any arbitrary sequence and
encoding it using this uh vocabul
and also decoding it back to Strings so
let's now Implement all that so here's
what I did I went to this block post
that I enjoyed and I took the first
paragraph and I copy pasted it here into
text so this is one very long line
here now to get the tokens as I
mentioned we just take our text and we
encode it into utf8 the tokens here at
this point will be a raw bites single
stream of bytes and just so that it's
easier to work with instead of just a
bytes object I'm going to convert all
those bytes to integers and then create
a list of it just so it's easier for us
to manipulate and work with in Python
and visualize and here I'm printing all
of that so this is the original um this
is the original paragraph and its length
is
533 uh code points and then here are the
bytes encoded in ut utf8 and we see that
this has a length of 616 bytes at this
point or 616 tokens and the reason this
is more is because a lot of these simple
asky characters or simple characters
they just become a single bite but a lot
of these Unicode more complex characters
become multiple bytes up to four and so
we are expanding that
size so now what we'd like to do as a
first step of the algorithm is we'd like
to iterate over here and find the pair
of bites that occur most frequently
because we're then going to merge it so
if you are working long on a notebook on
a side then I encourage you to basically
click on the link find this notebook and
try to write that function yourself
otherwise I'm going to come here and
Implement first the function that finds
the most common pair okay so here's what
I came up with there are many different
ways to implement this but I'm calling
the function get stats it expects a list
of integers I'm using a dictionary to
keep track of basically the counts and
then this is a pythonic way to iterate
consecutive elements of this list uh
which we covered in the previous video
and then here I'm just keeping track of
just incrementing by one um for all the
pairs so if I call this on all the
tokens here then the stats comes out
here so this is the dictionary the keys
are these topples of consecutive
elements and this is the count so just
to uh print it in a slightly better way
this is one way that I like to do that
where you it's a little bit compound
here so you can pause if you like but we
iterate all all the items the items
called on dictionary returns pairs of
key value and instead I create a list
here of value key because if it's a
value key list then I can call sort on
it and by default python will uh use the
first element which in this case will be
value to sort by if it's given tles and
then reverse so it's descending and
print that so basically it looks like
101 comma 32 was the most commonly
occurring consecutive pair and it
occurred 20 times we can double check
that that makes reasonable sense so if I
just search
10132 then you see that these are the 20
occurrences of that um pair and if we'd
like to take a look at what exactly that
pair is we can use Char which is the
opposite of or in Python so we give it a
um unic code Cod point so 101 and of 32
and we see that this is e and space so
basically there's a lot of E space here
meaning that a lot of these words seem
to end with e so here's eace as an
example so there's a lot of that going
on here and this is the most common pair
so now that we've identified the most
common pair we would like to iterate
over this sequence we're going to Mint a
new token with the ID of
256 right because these tokens currently
go from Z to 255 so when we create a new
token it will have an ID of
256 and we're going to iterate over this
entire um list and every every time we
see 101 comma 32 we're going to swap
that out for
256 so let's Implement that now and feel
free to uh do that yourself as well so
first I commented uh this just so we
don't pollute uh the notebook too much
this is a nice way of in Python
obtaining the highest ranking pair so
we're basically calling the Max on this
dictionary stats and this will return
the maximum
key and then the question is how does it
rank keys so you can provide it with a
function that ranks keys and that
function is just stats. getet uh stats.
getet would basically return the value
and so we're ranking by the value and
getting the maximum key so it's 101
comma 32 as we saw now to actually merge
10132 um this is the function that I
wrote but again there are many different
versions of it so we're going to take a
list of IDs and the the pair that we
want to replace and that pair will be
replaced with the new index
idx so iterating through IDs if we find
the pair swap it out for idx so we
create this new list and then we start
at zero and then we go through this
entire list sequentially from left to
right and here we are checking for
equality at the current position with
the
pair um so here we are checking that the
pair matches now here is a bit of a
tricky condition that you have to append
if you're trying to be careful and that
is that um you don't want this here to
be out of Bounds at the very last
position when you're on the rightmost
element of this list otherwise this
would uh give you an autof bounds error
so we have to make sure that we're not
at the very very last element so uh this
would be false for that so if we find a
match we append to this new list that
replacement index and we increment the
position by two so we skip over that
entire pair but otherwise if we we
haven't found a matching pair we just
sort of copy over the um element at that
position and increment by one then
return this so here's a very small toy
example if we have a list 566 791 and we
want to replace the occurrences of 67
with 99 then calling this on that will
give us what we're asking for so here
the 67 is replaced with
99 so now I'm going to uncomment this
for our actual use case where we want to
take our tokens we want to take the top
pair here and replace it with 256 to get
tokens to if we run this we get the
following so recall that previously we
had a length 616 in this list and now we
have a length 596 right so this
decreased by 20 which makes sense
because there are 20 occurrences
moreover we can try to find 256 here and
we see plenty of occurrences on off it
and moreover just double check there
should be no occurrence of 10132 so this
is the original array plenty of them and
in the second array there are no
occurrences of 1032 so we've
successfully merged this single pair and
now we just uh iterate this so we are
going to go over the sequence again find
the most common pair and replace it so
let me now write a y Loop that uses
these functions to do this um sort of
iteratively and how many times do we do
it four well that's totally up to us as
a hyper parameter
the more um steps we take the larger
will be our vocabulary and the shorter
will be our sequence and there is some
sweet spot that we usually find works
the best in practice and so this is kind
of a hyperparameter and we tune it and
we find good vocabulary sizes as an
example gp4 currently uses roughly
100,000 tokens and um bpark that those
are reasonable numbers currently instead
the are large language models so let me
now write uh putting putting it all
together and uh iterating these steps
okay now before we dive into the Y loop
I wanted to add one more cell here where
I went to the block post and instead of
grabbing just the first paragraph or two
I took the entire block post and I
stretched it out in a single line and
basically just using longer text will
allow us to have more representative
statistics for the bite Pairs and we'll
just get a more sensible results out of
it because it's longer text um so here
we have the raw text we encode it into
bytes using the utf8 encoding
and then here as before we are just
changing it into a list of integers in
Python just so it's easier to work with
instead of the raw byes objects and then
this is the code that I came up with uh
to actually do the merging in Loop these
two functions here are identical to what
we had above I only included them here
just so that you have the point of
reference here so uh these two are
identical and then this is the new code
that I added so the first first thing we
want to do is we want to decide on the
final vocabulary size that we want our
tokenizer to have and as I mentioned
this is a hyper parameter and you set it
in some way depending on your best
performance so let's say for us we're
going to use 276 because that way we're
going to be doing exactly 20
merges and uh 20 merges because we
already have
256 tokens for the raw bytes and to
reach 276 we have to do 20 merges uh to
add 20 new
tokens here uh this is uh one way in
Python to just create a copy of a list
so I'm taking the tokens list and by
wrapping it in a list python will
construct a new list of all the
individual elements so this is just a
copy
operation then here I'm creating a
merges uh dictionary so this merges
dictionary is going to maintain
basically the child one child two
mapping to a new uh token and so what
we're going to be building up here is a
binary tree of merges but actually it's
not exactly a tree because a tree would
have a single root node with a bunch of
leaves for us we're starting with the
leaves on the bottom which are the
individual bites those are the starting
256 tokens and then we're starting to
like merge two of them at a time and so
it's not a tree it's more like a forest
um uh as we merge these elements
so for 20 merges we're going to find the
most commonly occurring pair we're going
to Mint a new token integer for it so I
here will start at zero so we'll going
to start at 256 we're going to print
that we're merging it and we're going to
replace all of the occurrences of that
pair with the new new lied token and
we're going to record that this pair of
integers merged into this new
integer so running this gives us the
following
output so we did 20 merges and for
example the first merge was exactly as
before the
10132 um tokens merging into a new token
2556 now keep in mind that the
individual uh tokens 101 and 32 can
still occur in the sequence after
merging it's only when they occur
exactly consecutively that that becomes
256
now um and in particular the other thing
to notice here is that the token 256
which is the newly minted token is also
eligible for merging so here on the
bottom the 20th merge was a merge of 25
and 259 becoming
275 so every time we replace these
tokens they become eligible for merging
in the next round of data ration so
that's why we're building up a small
sort of binary Forest instead of a
single individual
tree one thing we can take a look at as
well is we can take a look at the
compression ratio that we've achieved so
in particular we started off with this
tokens list um so we started off with
24,000 bytes and after merging 20 times
uh we now have only
19,000 um tokens and so therefore the
compression ratio simply just dividing
the two is roughly 1.27 so that's the
amount of compression we were able to
achieve of this text with only 20
merges um and of course the more
vocabulary elements you add uh the
greater the compression ratio here would
be finally so that's kind of like um the
training of the tokenizer if you will
now 1 Point I wanted to make is that and
maybe this is a diagram that can help um
kind of illustrate is that tokenizer is
a completely separate object from the
large language model itself so
everything in this lecture we're not
really touching the llm itself uh we're
just training the tokenizer this is a
completely separate pre-processing stage
usually so the tokenizer will have its
own training set just like a large
language model has a potentially
different training set so the tokenizer
has a training set of documents on which
you're going to train the
tokenizer and then and um we're
performing The Bite pair encoding
algorithm as we saw above to train the
vocabulary of this
tokenizer so it has its own training set
it is a pre-processing stage that you
would run a single time in the beginning
um and the tokenizer is trained using
bipar coding algorithm once you have the
tokenizer once it's trained and you have
the vocabulary and you have the merges
uh we can do both encoding and decoding
so these two arrows here so the
tokenizer is a translation layer between
raw text which is as we saw the sequence
of Unicode code points it can take raw
text and turn it into a token sequence
and vice versa it can take a token
sequence and translate it back into raw
text so now that we have trained uh
tokenizer and we have these merges we
are going to turn to how we can do the
encoding and the decoding step if you
give me text here are the tokens and
vice versa if you give me tokens here's
the text once we have that we can
translate between these two Realms and
then the language model is going to be
trained as a step two afterwards and
typically in a in a sort of a
state-of-the-art application you might
take all of your training data for the
language model and you might run it
through the tokenizer and sort of
translate everything into a massive
token sequence and then you can throw
away the raw text you're just left with
the tokens themselves and those are
stored on disk and that is what the
large language model is actually reading
when it's training on them so this one
approach that you can take as a single
massive pre-processing step a
stage um so yeah basically I think the
most important thing I want to get
across is that this is completely
separate stage it usually has its own
entire uh training set you may want to
have those training sets be different
between the tokenizer and the logge
language model so for example when
you're training the tokenizer as I
mentioned we don't just care about the
performance of English text we care
about uh multi many different languages
and we also care about code or not code
so you may want to look into different
kinds of mixtures of different kinds of
languages and different amounts of code
and things like that because the amount
of different language that you have in
your tokenizer training set will
determine how many merges of it there
will be and therefore that determines
the density with which uh this type of
data is um sort of has in the token
space and so roughly speaking
intuitively if you add some amount of
data like say you have a ton of Japanese
data in your uh tokenizer training set
then that means that more Japanese
tokens will get merged
and therefore Japanese will have shorter
sequences uh and that's going to be
beneficial for the large language model
which has a finite context length on
which it can work on in in the token
space uh so hopefully that makes sense
so we're now going to turn to encoding
and decoding now that we have trained a
tokenizer so we have our merges and now
how do we do encoding and decoding okay
so let's begin with decoding which is
this Arrow over here so given a token
sequence let's go through the tokenizer
to get back a python string object so
the raw text so this is the function
that we' like to implement um we're
given the list of integers and we want
to return a python string if you'd like
uh try to implement this function
yourself it's a fun exercise otherwise
I'm going to start uh pasting in my own
solution so there are many different
ways to do it um here's one way I will
create an uh kind of pre-processing
variable that I will call
vocab and vocab is a mapping or a
dictionary in Python for from the token
uh ID to the bytes object for that token
so we begin with the raw bytes for
tokens from 0 to 255 and then we go in
order of all the merges and we sort of
uh populate this vocab list by doing an
addition here so this is the basically
the bytes representation of the first
child followed by the second one and
remember these are bytes objects so this
addition here is an addition of two
bytes objects just concatenation
so that's what we get
here one tricky thing to be careful with
by the way is that I'm iterating a
dictionary in Python using a DOT items
and uh it really matters that this runs
in the order in which we inserted items
into the merous dictionary luckily
starting with python 3.7 this is
guaranteed to be the case but before
python 3.7 this iteration may have been
out of order with respect to how we
inserted elements into merges and this
may not have worked but we are using an
um modern python so we're okay and then
here uh given the IDS the first thing
we're going to do is get the
tokens so the way I implemented this
here is I'm taking I'm iterating over
all the IDS I'm using vocap to look up
their bytes and then here this is one
way in Python to concatenate all these
bytes together to create our tokens and
then these tokens here at this point are
raw bytes so I have to decode using UTF
F now back into python strings so
previously we called that encode on a
string object to get the bytes and now
we're doing it Opposite we're taking the
bytes and calling a decode on the bytes
object to get a string in Python and
then we can return
text so um this is how we can do it now
this actually has a um issue um in the
way I implemented it and this could
actually throw an error so try to think
figure out why this code could actually
result in an error if we plug in um uh
some sequence of IDs that is
unlucky so let me demonstrate the issue
when I try to decode just something like
97 I am going to get letter A here back
so nothing too crazy happening but when
I try to decode 128 as a single element
the token 128 is what in string or in
Python object uni Cod decoder utfa can't
Decode by um 0x8 which is this in HEX in
position zero invalid start bite what
does that mean well to understand what
this means we have to go back to our
utf8 page uh that I briefly showed
earlier and this is Wikipedia utf8 and
basically there's a specific schema that
utfa bytes take so in particular if you
have a multi-te object for some of the
Unicode characters they have to have
this special sort of envelope in how the
encoding works and so what's happening
here is that invalid start pite that's
because
128 the binary representation of it is
one followed by all zeros so we have one
and then all zero and we see here that
that doesn't conform to the format
because one followed by all zero just
doesn't fit any of these rules so to
speak so it's an invalid start bite
which is byte one this one must have a
one following it and then a zero
following it and then the content of
your uni codee in x here so basically we
don't um exactly follow the utf8
standard and this cannot be decoded and
so the way to fix this um is to
use this errors equals in bytes. decode
function of python and by default errors
is strict so we will throw an error if
um it's not valid utf8 bytes encoding
but there are many different things that
you could put here on error handling
this is the full list of all the errors
that you can use and in particular
instead of strict let's change it to
replace and that will replace uh with
this special marker this replacement
character so errors equals replace and
now we just get that character
back so basically not every single by
sequence is valid
utf8 and if it happens that your large
language model for example predicts your
tokens in a bad manner then they might
not fall into valid utf8 and then we
won't be able to decode them so the
standard practice is to basically uh use
errors equals replace and this is what
you will also find in the openai um code
that they released as well but basically
whenever you see um this kind of a
character in your output in that case uh
something went wrong and the LM output
not was not valid uh sort of sequence of
tokens okay and now we're going to go
the other way so we are going to
implement
this Arrow right here where we are going
to be given a string and we want to
encode it into
tokens so this is the signature of the
function that we're interested in and um
this should basically print a list of
integers of the tokens so again uh try
to maybe implement this yourself if
you'd like a fun exercise uh and pause
here otherwise I'm going to start
putting in my
solution so again there are many ways to
do this so um this is one of the ways
that sort of I came came up with so the
first thing we're going to do is we are
going
to uh take our text encode it into utf8
to get the raw bytes and then as before
we're going to call list on the bytes
object to get a list of integers of
those bytes so those are the starting
tokens those are the raw bytes of our
sequence but now of course according to
the merges dictionary above and recall
this was the
merges some of the bytes may be merged
according to this lookup in addition to
that remember that the merges was built
from top to bottom and this is sort of
the order in which we inserted stuff
into merges and so we prefer to do all
these merges in the beginning before we
do these merges later because um for
example this merge over here relies on
the 256 which got merged here so we have
to go in the order from top to bottom
sort of if we are going to be merging
anything now we expect to be doing a few
merges so we're going to be doing W
true um and now we want to find a pair
of byes that is consecutive that we are
allowed to merge according to this in
order to reuse some of the functionality
that we've already written I'm going to
reuse the function uh get
stats so recall that get stats uh will
give us the we'll basically count up how
many times every single pair occurs in
our sequence of tokens and return that
as a dictionary and the dictionary was a
mapping from all the different uh by
pairs to the number of times that they
occur right um at this point we don't
actually care how many times they occur
in the sequence we only care what the
raw pairs are in that sequence and so
I'm only going to be using basically the
keys of the dictionary I only care about
the set of possible merge candidates if
that makes
sense now we want to identify the pair
that we're going to be merging at this
stage of the loop so what do we want we
want to find the pair or like the a key
inside stats that has the lowest index
in the merges uh dictionary because we
want to do all the early merges before
we work our way to the late
merges so again there are many different
ways to implement this but I'm going to
do something a little bit fancy
here so I'm going to be using the Min
over an iterator in Python when you call
Min on an iterator and stats here as a
dictionary we're going to be iterating
the keys of this dictionary in Python so
we're looking at all the pairs inside
stats um which are all the consecutive
Pairs and we're going to be taking the
consecutive pair inside tokens that has
the minimum what the Min takes a key
which gives us the function that is
going to return a value over which we're
going to do the Min and the one we care
about is we're we care about taking
merges and basically getting um that
pairs
index so basically for any pair inside
stats we are going to be looking into
merges at what index it has and we want
to get the pair with the Min number so
as an example if there's a pair 101 and
32 we definitely want to get that pair
uh we want to identify it here and
return it and pair would become 10132 if
it
occurs and the reason that I'm putting a
float INF here as a fall back is that in
the get function when we call uh when we
basically consider a pair that doesn't
occur in the merges then that pair is
not eligible to be merged right so if in
the token sequence there's some pair
that is not a merging pair it cannot be
merged then uh it doesn't actually occur
here and it doesn't have an index and uh
it cannot be merged which we will denote
as float INF and the reason Infinity is
nice here is because for sure we're
guaranteed that it's not going to
participate in the list of candidates
when we do the men so uh so this is one
way to do it so B basically long story
short this Returns the most eligible
merging candidate pair uh that occurs in
the tokens now one thing to be careful
with here is this uh function here might
fail in the following way if there's
nothing to merge then uh uh then there's
nothing in merges um that satisfi that
is satisfied anymore there's nothing to
merge everything just returns float imps
and then the pair I think will just
become the very first element of stats
um but this pair is not actually a
mergeable pair it just becomes the first
pair inside stats arbitrarily because
all of these pairs evaluate to float in
for the merging Criterion so basically
it could be that this this doesn't look
succeed because there's no more merging
pairs so if this pair is not in merges
that was returned then this is a signal
for us that actually there was nothing
to merge no single pair can be merged
anymore in that case we will break
out um nothing else can be
merged you may come up with a different
implementation by the way this is kind
of like really trying hard in
Python um but really we're just trying
to find a pair that can be merged with
the lowest index
here now if we did find a pair that is
inside merges with the lowest index then
we can merge it
so we're going to look into the merger
dictionary for that pair to look up the
index and we're going to now merge that
into that index so we're going to do
tokens equals and we're going to
replace the original tokens we're going
to be replacing the pair pair and we're
going to be replacing it with index idx
and this returns a new list of tokens
where every occurrence of pair is
replaced with idx so we're doing a merge
and we're going to be continuing this
until eventually nothing can be merged
we'll come out here and we'll break out
and here we just return
tokens and so that that's the
implementation I think so hopefully this
runs okay cool um yeah and this looks uh
reasonable so for example 32 is a space
in asky so that's here um so this looks
like it worked great okay so let's wrap
up this section of the video at least I
wanted to point out that this is not
quite the right implementation just yet
because we are leaving out a special
case so in particular if uh we try to do
this this would give us an error and the
issue is that um if we only have a
single character or an empty string then
stats is empty and that causes an issue
inside Min so one way to fight this is
if L of tokens is at least two because
if it's less than two it's just a single
token or no tokens then let's just uh
there's nothing to merge so we just
return so that would fix uh that
case Okay and then second I have a few
test cases here for us as well so first
let's make sure uh about or let's note
the following if we take a string and we
try to encode it and then decode it back
you'd expect to get the same string back
right is that true for all
strings so I think uh so here it is the
case and I think in general this is
probably the case um but notice that
going backwards is not is not you're not
going to have an identity going
backwards because as I mentioned us not
all token sequences are valid utf8 uh
sort of by streams and so so therefore
you're some of them can't even be
decodable um so this only goes in One
Direction but for that one direction we
can check uh here if we take the
training text which is the text that we
train to tokenizer around we can make
sure that when we encode and decode we
get the same thing back which is true
and here I took some validation data so
I went to I think this web page and I
grabbed some text so this is text that
the tokenizer has not seen and we can
make sure that this also works um okay
so that gives us some confidence that
this was correctly implemented
so those are the basics of the bite pair
encoding algorithm we saw how we can uh
take some training set train a tokenizer
the parameters of this tokenizer really
are just this dictionary of merges and
that basically creates the little binary
Forest on top of raw
bites once we have this the merges table
we can both encode and decode between
raw text and token sequences so that's
the the simplest setting of The
tokenizer what we're going to do now
though is we're going to look at some of
the St the art lar language models and
the kinds of tokenizers that they use
and we're going to see that this picture
complexifies very quickly so we're going
to go through the details of this comp
complexification one at a time so let's
kick things off by looking at the GPD
Series so in particular I have the gpt2
paper here um and this paper is from
2019 or so so 5 years ago and let's
scroll down to input representation this
is where they talk about the tokenizer
that they're using for gpd2 now this is
all fairly readable so I encourage you
to pause and um read this yourself but
this is where they motivate the use of
the bite pair encoding algorithm on the
bite level representation of utf8
encoding so this is where they motivate
it and they talk about the vocabulary
sizes and everything now everything here
is exactly as we've covered it so far
but things start to depart around here
so what they mention is that they don't
just apply the naive algorithm as we
have done it and in particular here's a
example suppose that you have common
words like dog what will happen is that
dog of course occurs very frequently in
the text and it occurs right next to all
kinds of punctuation as an example so
doc dot dog exclamation mark dog
question mark Etc and naively you might
imagine that the BP algorithm could
merge these to be single tokens and then
you end up with lots of tokens that are
just like dog with a slightly different
punctuation and so it feels like you're
clustering things that shouldn't be
clustered you're combining kind of
semantics with
uation and this uh feels suboptimal and
indeed they also say that this is
suboptimal according to some of the
experiments so what they want to do is
they want to top down in a manual way
enforce that some types of um characters
should never be merged together um so
they want to enforce these merging rules
on top of the bite PA encoding algorithm
so let's take a look um at their code
and see how they actually enforce this
and what kinds of mergy they actually do
perform so I have to to tab open here
for gpt2 under open AI on GitHub and
when we go to
Source there is an encoder thatp now I
don't personally love that they call it
encoder dopy because this is the
tokenizer and the tokenizer can do both
encode and decode uh so it feels kind of
awkward to me that it's called encoder
but that is the tokenizer and there's a
lot going on here and we're going to
step through it in detail at one point
for now I just want to focus on this
part here the create a rigix pattern
here that looks very complicated and
we're going to go through it in a bit uh
but this is the core part that allows
them to enforce rules uh for what parts
of the text Will Never Be merged for
sure now notice that re. compile here is
a little bit misleading because we're
not just doing import re which is the
python re module we're doing import reex
as re and reex is a python package that
you can install P install r x and it's
basically an extension of re so it's a
bit more powerful
re um
so let's take a look at this pattern and
what it's doing and why this is actually
doing the separation that they are
looking for okay so I've copy pasted the
pattern here to our jupit notebook where
we left off and let's take this pattern
for a spin so in the exact same way that
their code does we're going to call an
re. findall for this pattern on any
arbitrary string that we are interested
so this is the string that we want to
encode into tokens um to feed into n llm
like gpt2 so what exactly is this doing
well re. findall will take this pattern
and try to match it against a
string um the way this works is that you
are going from left to right in the
string and you're trying to match the
pattern and R.F find all will get all
the occurrences and organize them into a
list now when you look at the um when
you look at this pattern first of all
notice that this is a raw string um and
then these are three double quotes just
to start the string so really the string
itself this is the pattern itself
right and notice that it's made up of a
lot of ores so see these vertical bars
those are ores in reg X and so you go
from left to right in this pattern and
try to match it against the string
wherever you are so we have hello and
we're going to try to match it well it's
not apostrophe s it's not apostrophe t
or any of these but it is an optional
space followed by- P of uh sorry SL P of
L one or more times what is/ P of L it
is coming to some documentation that I
found um there might be other sources as
well uh SLP is a letter any kind of
letter from any language and hello is
made up of letters h e l Etc so optional
space followed by a bunch of letters one
or more letters is going to match hello
but then the match ends because a white
space is not a letter so from there on
begins a new sort of attempt to match
against the string again and starting in
here we're going to skip over all of
these again until we get to the exact
same Point again and we see that there's
an optional space this is the optional
space followed by a bunch of letters one
or more of them and so that matches so
when we run this we get a list of two
elements hello and then space world
so how are you if we add more letters we
would just get them like this now what
is this doing and why is this important
we are taking our string and instead of
directly encoding it um for
tokenization we are first splitting it
up and when you actually step through
the code and we'll do that in a bit more
detail what really is doing on a high
level is that it first splits your text
into a list of texts just like this one
and all these elements of this list are
processed independently by the tokenizer
and all of the results of that
processing are simply
concatenated so hello world oh I I
missed how hello world how are you we
have five elements of list all of these
will independent
independently go from text to a token
sequence and then that token sequence is
going to be concatenated it's all going
to be joined up and roughly speaking
what that does is you're only ever
finding merges between the elements of
this list so you can only ever consider
merges within every one of these
elements in
individually and um after you've done
all the possible merging for all of
these elements individually the results
of all that will be joined um by
concatenation and so you are basically
what what you're doing effectively is
you are never going to be merging this e
with this space because they are now
parts of the separate elements of this
list and so you are saying we are never
going to merge
eace um because we're breaking it up in
this way so basically using this regx
pattern to Chunk Up the text is just one
way of enforcing that some merges are
not to happen and we're going to go into
more of this text and we'll see that
what this is trying to do on a high
level is we're trying to not merge
across letters across numbers across
punctuation and so on so let's see in
more detail how that works so let's
continue now we have/ P ofn if you go to
the documentation SLP of n is any kind
of numeric character in any script so
it's numbers so we have an optional
space followed by numbers and those
would be separated out so letters and
numbers are being separated so if I do
Hello World 123 how are you then world
will stop matching here because one is
not a letter anymore but one is a number
so this group will match for that and
we'll get it as a separate entity
uh let's see how these apostrophes work
so here if we have
um uh Slash V or I mean apostrophe V as
an example then apostrophe here is not a
letter or a
number so hello will stop matching and
then we will exactly match this with
that so that will come out as a separate
thing so why are they doing the
apostrophes here honestly I think that
these are just like very common
apostrophes p uh that are used um
typically I don't love that they've done
this
because uh let me show you what happens
when you have uh some Unicode
apostrophes like for example you can
have if you have house then this will be
separated out because of this matching
but if you use the Unicode apostrophe
like
this then suddenly this does not work
and so this apostrophe will actually
become its own thing now and so so um
it's basically hardcoded for this
specific kind of apostrophe and uh
otherwise they become completely
separate tokens in addition to this you
can go to the gpt2 docs and here when
they Define the pattern they say should
have added re. ignore case so BP merges
can happen for capitalized versions of
contractions so what they're pointing
out is that you see how this is
apostrophe and then lowercase letters
well because they didn't do re. ignore
case then then um these rules will not
separate out the apostrophes if it's
uppercase so
house would be like this but if I did
house if I'm uppercase then notice
suddenly the apostrophe comes by
itself so the tokenization will work
differently in uppercase and lower case
inconsistently separating out these
apostrophes so it feels extremely gnarly
and slightly gross um but that's that's
how that works okay so let's come back
after trying to match a bunch of
apostrophe Expressions by the way the
other issue here is that these are quite
language specific probably so I don't
know that all the languages for example
use or don't use apostrophes but that
would be inconsistently tokenized as a
result then we try to match letters then
we try to match numbers and then if that
doesn't work we fall back to here and
what this is saying is again optional
space followed by something that is not
a letter number or a space in one or
more of that so what this is doing
effectively is this is trying to match
punctuation roughly speaking not letters
and not numbers so this group will try
to trigger for that so if I do something
like this then these parts here are not
letters or numbers but they will
actually they are uh they will actually
get caught here and so they become its
own group so we've separated out the
punctuation and finally this um this is
also a little bit confusing so this is
matching white space but this is using a
negative look ahead assertion in regex
so what this is doing is it's matching
wh space up to but not including the
last Whit space
character why is this important um this
is pretty subtle I think so you see how
the white space is always included at
the beginning of the word so um space r
space u Etc suppose we have a lot of
spaces
here what's going to happen here is that
these spaces up to not including the
last character will get caught by this
and what that will do is it will
separate out the spaces up to but not
including the last character so that the
last character can come here and join
with the um space you and the reason
that's nice is because space you is the
common token so if I didn't have these
Extra Spaces here you would just have
space you and if I add tokens if I add
spaces we still have a space view but
now we have all this extra white space
so basically the GB to tokenizer really
likes to have a space letters or numbers
um and it it preens these spaces and
this is just something that it is
consistent about so that's what that is
for and then finally we have all the the
last fallback is um whites space
characters uh so um that would be
just um if that doesn't get caught then
this thing will catch any trailing
spaces and so on I wanted to show one
more real world example here so if we
have this string which is a piece of
python code and then we try to split it
up then this is the kind of output we
get so you'll notice that the list has
many elements here and that's because we
are splitting up fairly often uh every
time sort of a category
changes um so there will never be any
merges Within These
elements and um that's what you are
seeing here now you might think that in
order to train the
tokenizer uh open AI has used this to
split up text into chunks and then run
just a BP algorithm within all the
chunks but that is not exactly what
happened and the reason is the following
notice that we have the spaces here uh
those Spaces end up being entire
elements but these spaces never actually
end up being merged by by open Ai and
the way you can tell is that if you copy
paste the exact same chunk here into Tik
token U Tik tokenizer you see that all
the spaces are kept independent and
they're all token
220 so I think opena at some point Point
en Force some rule that these spaces
would never be merged and so um there's
some additional rules on top of just
chunking and bpe that open ey is not uh
clear about now the training code for
the gpt2 tokenizer was never released so
all we have is uh the code that I've
already shown you but this code here
that they've released is only the
inference code for the tokens so this is
not the training code you can't give it
a piece of text and training tokenizer
this is just the inference code which
Tak takes the merges that we have up
above and applies them to a new piece of
text and so we don't know exactly how
opening ey trained um train the
tokenizer but it wasn't as simple as
chunk it up and BP it uh whatever it was
next I wanted to introduce you to the
Tik token library from openai which is
the official library for tokenization
from openai so this is Tik token bip
install P to Tik token and then um you
can do the tokenization in inference
this is again not training code this is
only inference code for
tokenization um I wanted to show you how
you would use it quite simple and
running this just gives us the gpt2
tokens or the GPT 4 tokens so this is
the tokenizer use for GPT 4 and so in
particular we see that the Whit space in
gpt2 remains unmerged but in GPT 4 uh
these Whit spaces merge as we also saw
in this one where here they're all
unmerged but if we go down to GPT 4 uh
they become merged
um now in the
gp4 uh tokenizer they changed the
regular expression that they use to
Chunk Up text so the way to see this is
that if you come to your the Tik token
uh library and then you go to this file
Tik token X openi public this is where
sort of like the definition of all these
different tokenizers that openi
maintains is and so uh necessarily to do
the inference they had to publish some
of the details about the strings
so this is the string that we already
saw for gpt2 it is slightly different
but it is actually equivalent uh to what
we discussed here so this pattern that
we discussed is equivalent to this
pattern this one just executes a little
bit faster so here you see a little bit
of a slightly different definition but
otherwise it's the same we're going to
go into special tokens in a bit and then
if you scroll down to CL 100k this is
the GPT 4 tokenizer you see that the
pattern has changed um and this is kind
of like the main the major change in
addition to a bunch of other special
tokens which I'll go into in a bit again
now some I'm not going to actually go
into the full detail of the pattern
change because honestly this is my
numbing uh I would just advise that you
pull out chat GPT and the regex
documentation and just step through it
but really the major changes are number
one you see this eye here that means
that the um case sensitivity this is
case insensitive match and so the
comment that we saw earlier on oh we
should have used re. uppercase uh
basically we're now going to be matching
these apostrophe s apostrophe D
apostrophe M Etc uh we're going to be
matching them both in lowercase and in
uppercase so that's fixed there's a
bunch of different like handling of the
whites space that I'm not going to go
into the full details of and then one
more thing here is you will notice that
when they match the numbers they only
match one to three numbers so so they
will never merge
numbers that are in low in more than
three digits only up to three digits of
numbers will ever be merged and uh
that's one change that they made as well
to prevent uh tokens that are very very
long number
sequences uh but again we don't really
know why they do any of this stuff uh
because none of this is documented and
uh it's just we just get the pattern so
um yeah it is what it is but those are
some of the changes that gp4 has made
and of course the vocabulary size went
from roughly 50k to roughly
100K the next thing I would like to do
very briefly is to take you through the
gpt2 encoder dopy that openi has
released uh this is the file that I
already mentioned to you briefly now
this file is uh fairly short and should
be relatively understandable to you at
this point um starting at the bottom
here they are loading two files encoder
Json and vocab bpe and they do some
light processing on it and then they
call this encoder object which is the
tokenizer now if you'd like to inspect
these two files which together
constitute their saved tokenizer then
you can do that with a piece of code
like
this um this is where you can download
these two files and you can inspect them
if you'd like and what you will find is
that this encoder as they call it in
their code is exactly equivalent to our
vocab so remember here where we have
this vocab object which allowed us us to
decode very efficiently and basically it
took us from the integer to the byes uh
for that integer so our vocab is exactly
their encoder and then their vocab bpe
confusingly is actually are merges so
their BP merges which is based on the
data inside vocab bpe ends up being
equivalent to our merges so uh basically
they are saving and loading the two uh
variables that for us are also critical
the merges variable and the vocab
variable using just these two variables
you can represent a tokenizer and you
can both do encoding and decoding once
you've trained this
tokenizer now the only thing that um is
actually slightly confusing inside what
opening ey does here is that in addition
to this encoder and a decoder they also
have something called a bite encoder and
a bite decoder and this is actually
unfortunately just
kind of a spirous implementation detail
and isn't actually deep or interesting
in any way so I'm going to skip the
discussion of it but what opening ey
does here for reasons that I don't fully
understand is that not only have they
this tokenizer which can encode and
decode but they have a whole separate
layer here in addition that is used
serially with the tokenizer and so you
first do um bite encode and then encode
and then you do decode and then bite
decode so that's the loop and they are
just stacked serial on top of each other
and and it's not that interesting so I
won't cover it and you can step through
it if you'd like otherwise this file if
you ignore the bite encoder and the bite
decoder will be algorithmically very
familiar with you and the meat of it
here is the what they call bpe function
and you should recognize this Loop here
which is very similar to our own y Loop
where they're trying to identify the
Byram uh a pair that they should be
merging next and then here just like we
had they have a for Loop trying to merge
this pair uh so they will go over all of
the sequence and they will merge the
pair whenever they find it and they keep
repeating that until they run out of
possible merges in the in the text so
that's the meat of this file and uh
there's an encode and a decode function
just like we have implemented it so long
story short what I want you to take away
at this point is that unfortunately it's
a little bit of a messy code that they
have but algorithmically it is identical
to what we've built up above and what
we've built up above if you understand
it is algorithmically what is necessary
to actually build a BP to organizer
train it and then both encode and decode
the next topic I would like to turn to
is that of special tokens so in addition
to tokens that are coming from you know
raw bytes and the BP merges we can
insert all kinds of tokens that we are
going to use to delimit different parts
of the data or introduced to create a
special structure of the token streams
so in uh if you look at this encoder
object from open AIS gpd2 right here we
mentioned this is very similar to our
vocab you'll notice that the length of
this is
50257 and as I mentioned it's mapping uh
and it's inverted from the mapping of
our vocab our vocab goes from integer to
string and they go the other way around
for no amazing reason um but the thing
to note here is that this the mapping
table here is
50257 where does that number come from
where what are the tokens as I mentioned
there are 256 raw bite token
tokens and then opena actually did
50,000
merges so those become the other tokens
but this would have been
50256 so what is the 57th token and
there is basically one special
token and that one special token you can
see is called end of text so this is a
special token and it's the very last
token and this token is used to delimit
documents ments in the training set so
when we're creating the training data we
have all these documents and we tokenize
them and we get a stream of tokens those
tokens only range from Z to
50256 and then in between those
documents we put special end of text
token and we insert that token in
between documents and we are using this
as a signal to the language model that
the document has ended and what follows
is going to be unrelated to the document
previously that said the language model
has to learn this from data it it needs
to learn that this token usually means
that it should wipe its sort of memory
of what came before and what came before
this token is not actually informative
to what comes next but we are expecting
the language model to just like learn
this but we're giving it the Special
sort of the limiter of these documents
we can go here to Tech tokenizer and um
this the gpt2 tokenizer uh our code that
we've been playing with before so we can
add here right hello world world how are
you and we're getting different tokens
but now you can see what if what happens
if I put end of text you see how until I
finished it these are all different
tokens end of
text still set different tokens and now
when I finish it suddenly we get token
50256 and the reason this works is
because this didn't actually go through
the bpe merges instead the code that
actually outposted tokens has special
case instructions for handling special
tokens um we did not see these special
instructions for handling special tokens
in the encoder dopy it's absent there
but if you go to Tech token Library
which is uh implemented in Rust you will
find all kinds of special case handling
for these special tokens that you can
register uh create adds to the
vocabulary and then it looks for them
and it uh whenever it sees these special
tokens like this it will actually come
in and swap in that special token so
these things are outside of the typical
algorithm of uh B PA en
coding so these special tokens are used
pervasively uh not just in uh basically
base language modeling of predicting the
next token in the sequence but
especially when it gets to later to the
fine tuning stage and all of the chat uh
gbt sort of aspects of it uh because we
don't just want to Del limit documents
we want to delimit entire conversations
between an assistant and a user so if I
refresh this sck tokenizer page the
default example that they have here is
using not sort of base model encoders
but ftuned model uh sort of tokenizers
um so for example using the GPT 3.5
turbo scheme these here are all special
tokens I am start I end Etc uh this is
short for Imaginary mcore start by the
way but you can see here that there's a
sort of start and end of every single
message and there can be many other
other tokens lots of tokens um in use to
delimit these conversations and kind of
keep track of the flow of the messages
here now we can go back to the Tik token
library and here when you scroll to the
bottom they talk about how you can
extend tick token and I can you can
create basically you can Fork uh the um
CL 100K base tokenizers in gp4 and for
example you can extend it by adding more
special tokens and these are totally up
to you you can come up with any
arbitrary tokens and add them with the
new ID afterwards and the tikken library
will uh correctly swap them out uh when
it sees this in the
strings now we can also go back to this
file which we've looked at previously
and I mentioned that the gpt2 in Tik
toen open
I.P we have the vocabulary we have the
pattern for splitting and then here we
are registering the single special token
in gpd2 which was the end of text token
and we saw that it has this ID
in GPT 4 when they defy this here you
see that the pattern has changed as
we've discussed but also the special
tokens have changed in this tokenizer so
we of course have the end of text just
like in gpd2 but we also see three sorry
four additional tokens here Thim prefix
middle and suffix what is fim fim is
short for fill in the middle and if
you'd like to learn more about this idea
it comes from this paper um and I'm not
going to go into detail in this video
it's beyond this video and then there's
one additional uh serve token here so
that's that encoding as well so it's
very common basically to train a
language model and then if you'd like uh
you can add special tokens now when you
add special tokens you of course have to
um do some model surgery to the
Transformer and all the parameters
involved in that Transformer because you
are basically adding an integer and you
want to make sure that for example your
embedding Matrix for the vocabulary
tokens has to be extended by adding a
row and typically this row would be
initialized uh with small random numbers
or something like that because we need
to have a vector that now stands for
that token in addition to that you have
to go to the final layer of the
Transformer and you have to make sure
that that projection at the very end
into the classifier uh is extended by
one as well so basically there's some
model surgery involved that you have to
couple with the tokenization changes if
you are going to add special tokens but
this is a very common operation that
people do especially if they'd like to
fine tune the model for example taking
it from a base model to a chat model
like chat
GPT okay so at this point you should
have everything you need in order to
build your own gp4 tokenizer now in the
process of developing this lecture I've
done that and I published the code under
this repository
MBP so MBP looks like this right now as
I'm recording but uh the MBP repository
will probably change quite a bit because
I intend to continue working on it um in
addition to the MBP repository I've
published the this uh exercise
progression that you can follow so if
you go to exercise. MD here uh this is
sort of me breaking up the task ahead of
you into four steps that sort of uh
build up to what can be a gp4 tokenizer
and so feel free to follow these steps
exactly and follow a little bit of the
guidance that I've laid out here and
anytime you feel stuck just reference
the MBP repository here so either the
tests could be useful or the MBP
repository itself I try to keep the code
fairly clean and understandable and so
um feel free to reference it whenever um
you get
stuck uh in addition to that basically
once you write it you should be able to
reproduce this behavior from Tech token
so getting the gb4 tokenizer you can
take uh you can encode the string and
you should get these tokens and then you
can encode and decode the exact same
string to recover it and in addition to
all that you should be able to implement
your own train function uh which Tik
token Library does not provide it's it's
again only inference code but you could
write your own train MBP does it as well
and that will allow you to train your
own token
vocabularies so here are some of the
code inside M be mean bpe uh shows the
token vocabularies that you might obtain
so on the left uh here we have the GPT 4
merges uh so the first 256 are raw
individual bytes and then here I am
visualizing the merges that gp4
performed during its training so the
very first merge that gp4 did was merge
two spaces into a single token for you
know two spaces and that is a token 256
and so this is the order in which things
merged during gb4 training and this is
the merge order that um we obtain in MBP
by training a tokenizer and in this case
I trained it on a Wikipedia page of
Taylor Swift uh not because I'm a Swifty
but because that is one of the longest
um Wikipedia Pages apparently that's
available but she is pretty cool and
um what was I going to say yeah so you
can compare these two uh vocabularies
and so as an example um here GPT for
merged I in to become in and we've done
the exact same thing on this token 259
here space t becomes space t and that
happened for us a little bit later as
well so the difference here is again to
my understanding only a difference of
the training set so as an example
because I see a lot of white space I
supect that gp4 probably had a lot of
python code in its training set I'm not
sure uh for the
tokenizer and uh here we see much less
of that of course in the Wikipedia page
so roughly speaking they look the same
and they look the same because they're
running the same algorithm and when you
train your own you're probably going to
get something similar depending on what
you train it on okay so we are now going
to move on from tick token and the way
that open AI tokenizes its strings and
we're going to discuss one more very
commonly used library for working with
tokenization inlm
and that is sentence piece so sentence
piece is very commonly used in language
models because unlike Tik token it can
do both training and inference and is
quite efficient at both it supports a
number of algorithms for training uh
vocabularies but one of them is the B
pair en coding algorithm that we've been
looking at so it supports it now
sentence piece is used both by llama and
mistal series and many other models as
well it is on GitHub under Google
sentence piece
and the big difference with sentence
piece and we're going to look at example
because this is kind of hard and subtle
to explain is that they think different
about the order of operations here so in
the case of Tik token we first take our
code points in the string we encode them
using mutf to bytes and then we're
merging bytes it's fairly
straightforward for sentence piece um it
works directly on the level of the code
points themselves so so it looks at
whatever code points are available in
your training set and then it starts
merging those code points and um the bpe
is running on the level of code
points and if you happen to run out of
code points so there are maybe some rare
uh code points that just don't come up
too often and the Rarity is determined
by this character coverage hyper
parameter then these uh code points will
either get mapped to a special unknown
token like ank or if you have the bite
foldback option turned on then that will
take those rare Cod points it will
encode them using utf8 and then the
individual bytes of that encoding will
be translated into tokens and there are
these special bite tokens that basically
get added to the vocabulary so it uses
BP on on the code points and then it
falls back to bytes for rare Cod points
um and so that's kind of like difference
personally I find the Tik token we
significantly cleaner uh but it's kind
of like a subtle but pretty major
difference between the way they approach
tokenization let's work with with a
concrete example because otherwise this
is kind of hard to um to get your head
around so let's work with a concrete
example this is how we can import
sentence piece and then here we're going
to take I think I took like the
description of sentence piece and I just
created like a little toy data set it
really likes to have a file so I created
a toy. txt file with this
content now what's kind of a little bit
crazy about sentence piece is that
there's a ton of options and
configurations and the reason this is so
is because sentence piece has been
around I think for a while and it really
tries to handle a large diversity of
things and um because it's been around I
think it has quite a bit of accumulated
historical baggage uh as well and so in
particular there's like a ton of
configuration arguments this is not even
all of it you can go to here to see all
the training
options um and uh there's also quite
useful documentation when you look at
the raw Proto buff uh that is used to
represent the trainer spec and so on um
many of these options are irrelevant to
us so maybe to point out one example Das
Das shrinking Factor uh this shrinking
factor is not used in the B pair en
coding algorithm so this is just an
argument that is irrelevant to us um it
applies to a different training
algorithm now what I tried to do here is
I tried to set up sentence piece in a
way that is very very similar as far as
I can tell to maybe identical hopefully
to the way that llama 2 was strained so
the way they trained their own um their
own tokenizer and the way I did this was
basically you can take the tokenizer
model file that meta released and you
can um open it using the Proto protuff
uh sort of file that you can generate
and then you can inspect all the options
and I tried to copy over all the options
that looked relevant so here we set up
the input it's raw text in this file
here's going to be the output so it's
going to be for talk 400. model and
vocab
we're saying that we're going to use the
BP algorithm and we want to Bap size of
400 then there's a ton of configurations
here
for um for basically pre-processing and
normalization rules as they're called
normalization used to be very prevalent
I would say before llms in natural
language processing so in machine
translation and uh text classification
and so on you want to normalize and
simplify the text and you want to turn
it all lowercase and you want to remove
all double whites space Etc
and in language models we prefer not to
do any of it or at least that is my
preference as a deep learning person you
want to not touch your data you want to
keep the raw data as much as possible um
in a raw
form so you're basically trying to turn
off a lot of this if you can the other
thing that sentence piece does is that
it has this concept of sentences so
sentence piece it's back it's kind of
like was developed I think early in the
days where there was um an idea that
they you're training a tokenizer on a
bunch of independent sentences so it has
a lot of like how many sentences you're
going to train on what is the maximum
sentence length
um shuffling sentences and so for it
sentences are kind of like the
individual training examples but again
in the context of llms I find that this
is like a very spous and weird
distinction like sentences are just like
don't touch the raw data sentences
happen to exist but in raw data sets
there are a lot of like inet like what
exactly is a sentence what isn't a
sentence um and so I think like it's
really hard to Define what an actual
sentence is if you really like dig into
it and there could be different concepts
of it in different languages or
something like that so why even
introduce the concept it it doesn't
honestly make sense to me I would just
prefer to treat a file as a giant uh
stream of
bytes it has a lot of treatment around
rare word characters and when I say word
I mean code points we're going to come
back to this in a second and it has a
lot of other rules for um basically
splitting digits splitting white space
and numbers and how you deal with that
so these are some kind of like merge
rules so I think this is a little bit
equivalent to tick token using the
regular expression to split up
categories there's like kind of
equivalence of it if you squint T it in
sentence piece where you can also for
example split up split up the digits uh
and uh so
on there's a few more things here that
I'll come back to in a bit and then
there are some special tokens that you
can indicate and it hardcodes the UN
token the beginning of sentence end of
sentence and a pad token um and the UN
token must exist for my understanding
and then some some things so we can
train and when when I press train it's
going to create this file talk 400.
model and talk 400. wab I can then load
the model file and I can inspect the
vocabulary off it and so we trained
vocab size 400 on this text here and
these are the individual pieces the
individual tokens that sentence piece
will create so in the beginning we see
that we have the an token uh with the ID
zero then we have the beginning of
sequence end of sequence one and two and
then we said that the pad ID is negative
1 so we chose not to use it so there's
no pad ID
here then these are individual bite
tokens so here we saw that bite fallback
in llama was turned on so it's true so
what follows are going to be the 256
bite
tokens and these are their
IDs and then at the bottom after the
bite tokens come the
merges and these are the parent nodes in
the merges so we're not seeing the
children we're just seeing the parents
and their
ID and then after the
merges comes eventually the individual
tokens and their IDs and so these are
the individual tokens so these are the
individual code Point tokens if you will
and they come at the end so that is the
ordering with which sentence piece sort
of like represents its vocabularies it
starts with special tokens then the bike
tokens then the merge tokens and then
the individual codo tokens and all these
raw codepoint to tokens are the ones
that it encountered in the training
set so those individual code points are
all the the entire set of code points
that occurred
here so those all get put in there and
then those that are extremely rare as
determined by character coverage so if a
code Point occurred only a single time
out of like a million um sentences or
something like that then it would be
ignored and it would not be added to our
uh
vocabulary once we have a vocabulary we
can encode into IDs and we can um sort
of get a
list and then here I am also decoding
the indiv idual tokens back into little
pieces as they call it so let's take a
look at what happened here hello space
on so these are the token IDs we got
back and when we look here uh a few
things sort of uh jump to mind number
one take a look at these characters the
Korean characters of course were not
part of the training set so sentence
piece is encountering code points that
it has not seen during training time and
those code points do not have a token
associated with them so suddenly these
are un tokens unknown tokens but because
bite fall back as true instead sentence
piece falls back to bytes and so it
takes this it encodes it with utf8 and
then it uses these tokens to represent
uh those bytes and that's what we are
getting sort of here this is the utf8 uh
encoding and in this shifted by three uh
because of these um special tokens here
that have IDs earlier on so that's what
happened here now one more thing that um
well first before I go on with respect
to the bitef back let me remove bite
foldback if this is false what's going
to happen let's
retrain so the first thing that happened
is all the bite tokens disappeared right
and now we just have the merges and we
have a lot more merges now because we
have a lot more space because we're not
taking up space in the wab size uh with
all the
bytes and now if we encode
this we get a zero so this entire string
here suddenly there's no bitef back so
this is unknown and unknown is an and so
this is zero because the an token is
token zero and you have to keep in mind
that this would feed into your uh
language model so what is a language
model supposed to do when all kinds of
different things that are unrecognized
because they're rare just end up mapping
into Unk it's not exactly the property
that you want so that's why I think
llama correctly uh used by fallback true
uh because we definitely want to feed
these um unknown or rare code points
into the model and some uh some manner
the next thing I want to show you is the
following notice here when we are
decoding all the individual tokens you
see how spaces uh space here ends up
being this um bold underline I'm not
100% sure by the way why sentence piece
switches whites space into these bold
underscore characters maybe it's for
visualization I'm not 100% sure why that
happens uh but notice this why do we
have an extra space in the front of
hello um what where is this coming from
well it's coming from this option
here
um add dummy prefix is true and when you
go to the
documentation add D whites space at the
beginning of text in order to treat
World in world and hello world in the
exact same way so what this is trying to
do is the
following if we go back to our tick
tokenizer world as uh token by itself
has a different ID than space world so
we have this is 1917 but this is 14 Etc
so these are two different tokens for
the language model and the language
model has to learn from data that they
are actually kind of like a very similar
concept so to the language model in the
Tik token World um basically words in
the beginning of sentences and words in
the middle of sentences actually look
completely different um and it has to
learned that they are roughly the same
so this add dami prefix is trying to
fight that a little bit and the way that
works is that it basically
uh adds a dummy prefix so for as a as a
part of pre-processing it will take the
string and it will add a space it will
do this and that's done in an effort to
make this world and that world the same
they will both be space world so that's
one other kind of pre-processing option
that is turned on and llama 2 also uh
uses this option and that's I think
everything that I want to say for my
preview of sentence piece and how it is
different um maybe here what I've done
is I just uh put in the Raw protocol
buffer representation basically of the
tokenizer the too trained so feel free
to sort of Step through this and if you
would like uh your tokenization to look
identical to that of the meta uh llama 2
then you would be copy pasting these
settings as I tried to do up above and
uh yeah that's I think that's it for
this section I think my summary for
sentence piece from all of this is
number one I think that there's a lot of
historical baggage in sentence piece a
lot of Concepts that I think are
slightly confusing and I think
potentially um contain foot guns like
this concept of a sentence and it's
maximum length and stuff like that um
otherwise it is fairly commonly used in
the industry um because it is efficient
and can do both training and inference
uh it has a few quirks like for example
un token must exist and the way the bite
fallbacks are done and so on I don't
find particularly elegant and
unfortunately I have to say it's not
very well documented so it took me a lot
of time working with this myself um and
just visualizing things and trying to
really understand what is happening here
because uh the documentation
unfortunately is in my opion not not
super amazing but it is a very nice repo
that is available to you if you'd like
to train your own tokenizer right now
okay let me now switch gears again as
we're starting to slowly wrap up here I
want to revisit this issue in a bit more
detail of how we should set the vocap
size and what are some of the
considerations around it so for this I'd
like to go back to the model
architecture that we developed in the
last video when we built the GPT from
scratch so this here was uh the file
that we built in the previous video and
we defined the Transformer model and and
let's specifically look at Bap size and
where it appears in this file so here we
Define the voap size uh at this time it
was 65 or something like that extremely
small number so this will grow much
larger you'll see that Bap size doesn't
come up too much in most of these layers
the only place that it comes up to is in
exactly these two places here so when we
Define the language model there's the
token embedding table which is this
two-dimensional array where the vocap
size is basically the number of rows and
uh each vocabulary element each token
has a vector that we're going to train
using back propagation that Vector is of
size and embed which is number of
channels in the Transformer and
basically as voap size increases this
embedding table as I mentioned earlier
is going to also grow we're going to be
adding rows in addition to that at the
end of the Transformer there's this LM
head layer which is a linear layer and
you'll notice that that layer is used at
the very end to produce the logits uh
which become the probabilities for the
next token in sequence and so
intuitively we're trying to produce a
probability for every single token that
might come next at every point in time
of that Transformer and if we have more
and more tokens we need to produce more
and more probabilities so every single
token is going to introduce an
additional dot product that we have to
do here in this linear layer for this
final layer in a
Transformer so why can't vocap size be
infinite why can't we grow to Infinity
well number one your token embedding
table is going to grow uh your linear
layer is going to grow so we're going to
be doing a lot more computation here
because this LM head layer will become
more computational expensive number two
because we have more parameters we could
be worried that we are going to be under
trining some of these
parameters so intuitively if you have a
very large vocabulary size say we have a
million uh tokens then every one of
these tokens is going to come up more
and more rarely in the training data
because there's a lot more other tokens
all over the place and so we're going to
be seeing fewer and fewer examples uh
for each individual token and you might
be worried that basically the vectors
associated with every token will be
undertrained as a result because they
just don't come up too often and they
don't participate in the forward
backward pass in addition to that as
your vocab size grows you're going to
start shrinking your sequences a lot
right and that's really nice because
that means that we're going to be
attending to more and more text so
that's nice but also you might be
worrying that two large of chunks are
being squished into single tokens and so
the model just doesn't have as much of
time to think per sort of um some number
of characters in the text or you can
think about it that way right so
basically we're squishing too much
information into a single token and then
the forward pass of the Transformer is
not enough to actually process that
information appropriately and so these
are some of the considerations you're
thinking about when you're designing the
vocab size as I mentioned this is mostly
an empirical hyperparameter and it seems
like in state-of-the-art architectures
today this is usually in the high 10,000
or somewhere around 100,000 today and
the next consideration I want to briefly
talk about is what if we want to take a
pre-trained model and we want to extend
the vocap size and this is done fairly
commonly actually so for example when
you're doing fine-tuning for cha GPT um
a lot more new special tokens get
introduced on top of the base model to
maintain the metadata and all the
structure of conversation objects
between a user and an assistant so that
takes a lot of special tokens you might
also try to throw in more special tokens
for example for using the browser or any
other tool and so it's very tempting to
add a lot of tokens for all kinds of
special functionality so if you want to
be adding a token that's totally
possible Right all we have to do is we
have to resize this embedding so we have
to add rows we would initialize these uh
parameters from scratch to be small
random numbers and then we have to
extend the weight inside this linear uh
so we have to start making dot products
um with the associated parameters as
well to basically calculate the
probabilities for these new tokens so
both of these are just a resizing
operation it's a very mild
model surgery and can be done fairly
easily and it's quite common that
basically you would freeze the base
model you introduce these new parameters
and then you only train these new
parameters to introduce new tokens into
the architecture um and so you can
freeze arbitrary parts of it or you can
train arbitrary parts of it and that's
totally up to you but basically minor
surgery required if you'd like to
introduce new tokens and finally I'd
like to mention that actually there's an
entire design space of applications in
terms of introducing new tokens into a
vocabulary that go Way Beyond just
adding special tokens and special new
functionality so just to give you a
sense of the design space but this could
be an entire video just by itself uh
this is a paper on learning to compress
prompts with what they called uh gist
tokens and the rough idea is suppose
that you're using language models in a
setting that requires very long prompts
while these long prompts just slow
everything down because you have to
encode them and then you have to use
them and then you're tending over them
and it's just um you know heavy to have
very large prompts so instead what they
do here in this paper is they introduce
new tokens and um imagine basically
having a few new tokens you put them in
a sequence and then you train the model
by distillation so you are keeping the
entire model Frozen and you're only
training the representations of the new
tokens their embeddings and you're
optimizing over the new tokens such that
the behavior of the language model is
identical uh to the model that has a
very long prompt that works for you and
so it's a compression technique of
compressing that very long prompt into
those few new gist tokens and so you can
train this and then at test time you can
discard your old prompt and just swap in
those tokens and they sort of like uh
stand in for that very long prompt and
have an almost identical performance and
so this is one um technique and a class
of parameter efficient fine-tuning
techniques where most of the model is
basically fixed and there's no training
of the model weights there's no training
of Laura or anything like that of new
parameters the the parameters that
you're training are now just the uh
token embeddings so that's just one
example but this could again be like an
entire video but just to give you a
sense that there's a whole design space
here that is potentially worth exploring
in the future the next thing I want to
briefly address is that I think recently
there's a lot of momentum in how you
actually could construct Transformers
that can simultaneously process not just
text as the input modality but a lot of
other modalities so be it images videos
audio Etc and how do you feed in all
these modalities and potentially predict
these modalities from a Transformer uh
do you have to change the architecture
in some fundamental way and I think what
a lot of people are starting to converge
towards is that you're not changing the
architecture you stick with the
Transformer you just kind of tokenize
your input domains and then call the day
and pretend it's just text tokens and
just do everything else identical in an
identical manner so here for example
there was a early paper that has nice
graphic for how you can take an image
and you can chunc at it into
integers um and these sometimes uh so
these will basically become the tokens
of images as an example and uh these
tokens can be uh hard tokens where you
force them to be integers they can also
be soft tokens where you uh sort of
don't require uh these to be discrete
but you do Force these representations
to go through bottlenecks like in Auto
encoders uh also in this paper that came
out from open a SORA which I think
really um uh blew the mind of many
people and inspired a lot of people in
terms of what's possible they have a
Graphic here and they talk briefly about
how llms have text tokens Sora has
visual patches so again they came up
with a way to chunc a videos into
basically tokens when they own
vocabularies and then you can either
process discrete tokens say with autog
regressive models or even soft tokens
with diffusion models and uh all of that
is sort of uh being actively worked on
designed on and is beyond the scope of
this video but just something I wanted
to mention briefly okay now that we have
come quite deep into the tokenization
algorithm and we understand a lot more
about how it works let's loop back
around to the beginning of this video
and go through some of these bullet
points and really see why they happen so
first of all why can't my llm spell
words very well or do other spell
related
tasks so fundamentally this is because
as we saw these characters are chunked
up into tokens and some of these tokens
are actually fairly long so as an
example I went to the gp4 vocabulary and
I looked at uh one of the longer tokens
so that default style turns out to be a
single individual token so that's a lot
of characters for a single token so my
suspicion is that there's just too much
crammed into this single token and my
suspicion was that the model should not
be very good at tasks related to
spelling of this uh single token so I
asked how many letters L are there in
the word default style and of course my
prompt is intentionally done that way
and you see how default style will be a
single token so this is what the model
sees so my suspicion is that it wouldn't
be very good at this and indeed it is
not it doesn't actually know how many
L's are in there it thinks there are
three and actually there are four if I'm
not getting this wrong myself so that
didn't go extremely well let's look look
at another kind of uh character level
task so for example here I asked uh gp4
to reverse the string default style and
they tried to use a code interpreter and
I stopped it and I said just do it just
try it and uh it gave me jumble so it
doesn't actually really know how to
reverse this string going from right to
left uh so it gave a wrong result so
again like working with this working
hypothesis that maybe this is due to the
tokenization I tried a different
approach I said okay let's reverse the
exact same string but take the following
approach step one just print out every
single character separated by spaces and
then as a step two reverse that list and
it again Tred to use a tool but when I
stopped it it uh first uh produced all
the characters and that was actually
correct and then It reversed them and
that was correct once it had this so
somehow it can't reverse it directly but
when you go just first uh you know
listing it out in order it can do that
somehow and then it can once it's uh
broken up this way this becomes all
these individual characters and so now
this is much easier for it to see these
individual tokens and reverse them and
print them out so that is kind of
interesting so let's continue now why
are llms worse at uh non-english langu
and I briefly covered this already but
basically um it's not only that the
language model sees less non-english
data during training of the model
parameters but also the tokenizer is not
um is not sufficiently trained on
non-english data and so here for example
hello how are you is five tokens and its
translation is 15 tokens so this is a
three times blow up and so for example
anang is uh just hello basically in
Korean and that end up being three
tokens I'm actually kind of surprised by
that because that is a very common
phrase there just the typical greeting
of like hello and that ends up being
three tokens whereas our hello is a
single token and so basically everything
is a lot more bloated and diffuse and
this is I think partly the reason that
the model Works worse on other
languages uh coming back why is LM bad
at simple arithmetic um that has to do
with the tokenization of numbers and so
um you'll notice that for example
addition is very sort of
like uh there's an algorithm that is
like character level for doing addition
so for example here we would first add
the ones and then the tens and then the
hundreds you have to refer to specific
parts of these digits but uh these
numbers are represented completely
arbitrarily based on whatever happened
to merge or not merge during the
tokenization process there's an entire
blog post about this that I think is
quite good integer tokenization is
insane and this person basically
systematically explores the tokenization
of numbers in I believe this is gpt2 and
so they notice that for example for the
for um four-digit numbers you can take a
look at whether it is uh a single token
or whether it is two tokens that is a 1
three or a 2 two or a 31 combination and
so all the different numbers are all the
different combinations and you can
imagine this is all completely
arbitrarily so and the model
unfortunately sometimes sees uh four um
a token for for all four digits
sometimes for three sometimes for two
sometimes for one and it's in an
arbitrary uh Manner and so this is
definitely a headwind if you will for
the language model and it's kind of
incredible that it can kind of do it and
deal with it but it's also kind of not
ideal and so that's why for example we
saw that meta when they train the Llama
2 algorithm and they use sentence piece
they make sure to split up all the um
all the digits as an example for uh
llama 2 and this is partly to improve a
simple arithmetic kind of
performance and finally why is gpt2 not
as good in Python again this is partly a
modeling issue on in the architecture
and the data set and the strength of the
model but it's also partially
tokenization because as we saw here with
the simple python example the encoding
efficiency of the tokenizer for handling
spaces in Python is terrible and every
single space is an individual token and
this dramatically reduces the context
length that the model can attend to
cross so that's almost like a
tokenization bug for gpd2 and that was
later fixed with gp4 okay so here's
another fun one my llm abruptly halts
when it sees the string end of text so
here's um here's a very strange Behavior
print a string end of text is what I
told jt4 and it says could you please
specify the string and I'm I'm telling
it give me end of text and it seems like
there's an issue it's not seeing end of
text and then I give it end of text is
the string and then here's a string and
then it just doesn't print it so
obviously something is breaking here
with respect to the handling of the
special token and I don't actually know
what open ey is doing under the hood
here and whether they are potentially
parsing this as an um as an actual token
instead of this just being uh end of
text um as like individual sort of
pieces of it without the special token
handling logic and so it might be that
someone when they're calling do encode
uh they are passing in the allowed
special and they are allowing end of
text as a special character in the user
prompt but the user prompt of course is
is a sort of um attacker controlled text
so you would hope that they don't really
parse or use special tokens or you know
from that kind of input but it appears
that there's something definitely going
wrong here and um so your knowledge of
these special tokens ends up being in a
tax surface potentially and so if you'd
like to confuse llms then just um try to
give them some special tokens and see if
you're breaking something by chance okay
so this next one is a really fun one uh
the trailing whites space issue so if
you come to playground and uh we come
here to GPT 3.5 turbo instruct so this
is not a chat model this is a completion
model so think of it more like it's a
lot more closer to a base model it does
completion it will continue the token
sequence so here's a tagline for ice
cream shop and we want to continue the
sequence and so we can submit and get a
bunch of tokens okay no problem but now
suppose I do this but instead of
pressing submit here I do here's a
tagline for ice cream shop space so I
have a space here before I click
submit we get a warning your text ends
in a trail Ling space which causes worse
performance due to how API splits text
into tokens so what's happening here it
still gave us a uh sort of completion
here but let's take a look at what's
happening so here's a tagline for an ice
cream shop and then what does this look
like in the actual actual training data
suppose you found the completion in the
training document somewhere on the
internet and the llm trained on this
data so maybe it's something like oh
yeah maybe that's the tagline that's a
terrible tagline but notice here that
when I create o you see that because
there's the the space character is
always a prefix to these tokens in GPT
so it's not an O token it's a space o
token the space is part of the O and
together they are token 8840 that's
that's space o so what's What's
Happening Here is that when I just have
it like this and I let it complete the
next token it can sample the space o
token but instead if I have this and I
add my space then what I'm doing here
when I incode this string is I have
basically here's a t line for an ice
cream uh shop and this space at the very
end becomes a token
220 and so we've added token 220 and
this token otherwise would be part of
the tagline because if there actually is
a tagline here so space o is the token
and so this is suddenly a of
distribution for the model because this
space is part of the next token but
we're putting it here like this and the
model has seen very very little data of
actual Space by itself and we're asking
it to complete the sequence like add in
more tokens but the problem is that
we've sort of begun the first token and
now it's been split up and now we're out
of this distribution and now arbitrary
bad things happen and it's just a very
rare example for it to see something
like that and uh that's why we get the
warning so the fundamental issue here is
of course that um the llm is on top of
these tokens and these tokens are text
chunks they're not characters in a way
you and I would think of them they are
these are the atoms of what the LM is
seeing and there's a bunch of weird
stuff that comes out of it let's go back
to our default cell style I bet you that
the model has never in its training set
seen default cell sta without Le in
there it's always seen this as a single
group because uh this is some kind of a
function in um I'm guess I don't
actually know what this is part of this
is some kind of API but I bet you that
it's never seen this combination of
tokens uh in its training data because
or I think it would be extremely rare so
I took this and I copy pasted it here
and I had I tried to complete from it
and the it immediately gave me a big
error and it said the model predicted to
completion that begins with a stop
sequence resulting in no output consider
adjusting your prompt or stop sequences
so what happened here when I clicked
submit is that immediately the model
emitted and sort of like end of text
token I think or something like that it
basically predicted the stop sequence
immediately so it had no completion and
so this is why I'm getting a warning
again because we're off the data
distribution and the model is just uh
predicting just totally arbitrary things
it's just really confused basically this
is uh this is giving it brain damage
it's never seen this before it's shocked
and it's predicting end of text or
something I tried it again here and it
in this case it completed it but then
for some reason this request May violate
our usage policies this was
flagged um basically something just like
goes wrong and there's something like
Jank you can just feel the Jank because
the model is like extremely unhappy with
just this and it doesn't know how to
complete it because it's never occurred
in training set in a training set it
always appears like this and becomes a
single token
so these kinds of issues where tokens
are either you sort of like complete the
first character of the next token or you
are sort of you have long tokens that
you then have just some of the
characters off all of these are kind of
like issues with partial tokens is how I
would describe it and if you actually
dig into the T token
repository go to the rust code and
search for
unstable and you'll see um en code
unstable native unstable token tokens
and a lot of like special case handling
none of this stuff about unstable tokens
is documented anywhere but there's a ton
of code dealing with unstable tokens and
unstable tokens is exactly kind of like
what I'm describing here what you would
like out of a completion API is
something a lot more fancy like if we're
putting in default cell sta if we're
asking for the next token sequence we're
not actually trying to append the next
token exactly after this list we're
actually trying to append we're trying
to consider lots of tokens um
that if we were or I guess like we're
trying to search over characters that if
we retened would be of high probability
if that makes sense um so that we can
actually add a single individual
character uh instead of just like adding
the next full token that comes after
this partial token list so I this is
very tricky to describe and I invite you
to maybe like look through this it ends
up being extremely gnarly and hairy kind
of topic it and it comes from
tokenization fundamentally so um maybe I
can even spend an entire video talking
about unstable tokens sometime in the
future okay and I'm really saving the
best for last my favorite one by far is
the solid gold
Magikarp and it just okay so this comes
from this blog post uh solid gold
Magikarp and uh this is um internet
famous now for those of us in llms and
basically I I would advise you to uh
read this block Post in full but
basically what this person was doing is
this person went to the um
token embedding stable and clustered the
tokens based on their embedding
representation and this person noticed
that there's a cluster of tokens that
look really strange so there's a cluster
here at rot e stream Fame solid gold
Magikarp Signet message like really
weird tokens in uh basically in this
embedding cluster and so what are these
tokens and where do they even come from
like what is solid gold magikarpet makes
no sense and then they found bunch of
these
tokens and then they notice that
actually the plot thickens here because
if you ask the model about these tokens
like you ask it uh some very benign
question like please can you repeat back
to me the string sold gold Magikarp uh
then you get a variety of basically
totally broken llm Behavior so either
you get evasion so I'm sorry I can't
hear you or you get a bunch of
hallucinations as a response um you can
even get back like insults so you ask it
uh about streamer bot it uh tells the
and the model actually just calls you
names uh or it kind of comes up with
like weird humor like you're actually
breaking the model by asking about these
very simple strings like at Roth and
sold gold Magikarp so like what the hell
is happening and there's a variety of
here documented behaviors uh there's a
bunch of tokens not just so good
Magikarp that have that kind of a
behavior and so basically there's a
bunch of like trigger words and if you
ask the model about these trigger words
or you just include them in your prompt
the model goes haywire and has all kinds
of uh really Strange Behaviors including
sort of ones that violate typical safety
guidelines uh and the alignment of the
model like it's swearing back at you so
what is happening here and how can this
possibly be true well this again comes
down to tokenization so what's happening
here is that sold gold Magikarp if you
actually dig into it is a Reddit user so
there's a u Sol gold
Magikarp and probably what happened here
even though I I don't know that this has
been like really definitively explored
but what is thought to have happened is
that the tokenization data set was very
different from the training data set for
the actual language model so in the
tokenization data set there was a ton of
redded data potentially where the user
solid gold Magikarp was mentioned in the
text because solid gold Magikarp was a
very common um sort of uh person who
would post a lot uh this would be a
string that occurs many times in a
tokenization data set because it occurs
many times in a tokenization data set
these tokens would end up getting merged
to the single individual token for that
single Reddit user sold gold Magikarp so
they would have a dedicated token in a
vocabulary of was it 50,000 tokens in
gpd2 that is devoted to that Reddit user
and then what happens is the
tokenization data set has those strings
but then later when you train the model
the language model itself um this data
from Reddit was not present and so
therefore in the entire training set for
the language model sold gold Magikarp
never occurs that token never appears in
the training set for the actual language
model later so this token never gets
activated it's initialized at random in
the beginning of optimization then you
have forward backward passes and updates
to the model and this token is just
never updated in the embedding table
that row Vector never gets sampled it
never gets used so it never gets trained
and it's completely untrained it's kind
of like unallocated memory in a typical
binary program written in C or something
like that that so it's unallocated
memory and then at test time if you
evoke this token then you're basically
plucking out a row of the embedding
table that is completely untrained and
that feeds into a Transformer and
creates undefined behavior and that's
what we're seeing here this completely
undefined never before seen in a
training behavior and so any of these
kind of like weird tokens would evoke
this Behavior because fundamentally the
model is um is uh uh out of sample out
of distribution okay and the very last
thing I wanted to just briefly mention
point out although I think a lot of
people are quite aware of this is that
different kinds of formats and different
representations and different languages
and so on might be more or less
efficient with GPD tokenizers uh or any
tokenizers for any other L for that
matter so for example Json is actually
really dense in tokens and yaml is a lot
more efficient in tokens um so for
example this are these are the same in
Json and in yaml the Json is
116 and the yaml is 99 so quite a bit of
an Improvement and so in the token
economy where we are paying uh per token
in many ways and you are paying in the
context length and you're paying in um
dollar amount for uh the cost of
processing all this kind of structured
data when you have to um so prefer to
use theal over Json and in general kind
of like the tokenization density is
something that you have to um sort of
care about and worry about at all times
and try to find efficient encoding
schemes and spend a lot of time in tick
tokenizer and measure the different
token efficiencies of different formats
and settings and so on okay so that
concludes my fairly long video on
tokenization I know it's a try I know
it's annoying I know it's irritating I
personally really dislike the stage what
I do have to say at this point is don't
brush it off there's a lot of foot guns
sharp edges here security issues uh AI
safety issues as we saw plugging in
unallocated memory into uh language
models so um it's worth understanding
this stage um that said I will say that
eternal glory goes to anyone who can get
rid of it uh I showed you one possible
paper that tried to uh do that and I
think I hope a lot more can follow over
time and my final recommendations for
the application right now are if you can
reuse the GPT 4 tokens and the
vocabulary uh in your application then
that's something you should consider and
just use Tech token because it is very
efficient and nice library for inference
for bpe I also really like the bite
level BP that uh Tik toen and openi uses
uh if you for some reason want to train
your own vocabulary from scratch um then
I would use uh the bpe with sentence
piece um oops as I mentioned I'm not a
huge fan of sentence piece I don't like
its uh bite fallback and I don't like
that it's doing BP on unic code code
points I think it's uh it also has like
a million settings and I think there's a
lot of foot gonss here and I think it's
really easy to Mis calibrate them and
you end up cropping your sentences or
something like that uh because of some
type of parameter that you don't fully
understand so so be very careful with
the settings try to copy paste exactly
maybe where what meta did or basically
spend a lot of time looking at all the
hyper parameters and go through the code
of sentence piece and make sure that you
have this correct um but even if you
have all the settings correct I still
think that the algorithm is kind of
inferior to what's happening here and
maybe the best if you really need to
train your vocabulary maybe the best
thing is to just wait for M bpe to
becomes as efficient as possible and uh
that's something that maybe I hope to
work on and at some point maybe we can
be training basically really what we
want is we want tick token but training
code and that is the ideal thing that
currently does not exist and MBP is um
is in implementation of it but currently
it's in Python so that's currently what
I have to say for uh tokenization there
might be an advanced video that has even
drier and even more detailed in the
future but for now I think we're going
to leave things off here and uh I hope
that was helpful bye
and uh they increase this contact size
from gpt1 of 512 uh to 1024 and GPT 4
two the
next okay next I would like us to
briefly walk through the code from open
AI on the gpt2 encoded
ATP I'm sorry I'm gonna sneeze
and then what's Happening Here
is this is a spous layer that I will
explain in a
bit What's Happening Here
is
Loading video analysis...