LLM & Transformers Explained in 5 Minutes | How ChatGPT and Modern AI Actually Work

By Wreetojyoti Ray

Summary

## Key takeaways - **Core Job: Predict Next Word**: At its very core, all a large language model is trying to do, its one and only job, is to predict the next word in a sentence. But when you do that one simple thing at a ridiculous scale with billions and billions of data points, something truly incredible starts to happen. [00:43], [00:52] - **Tokenization Chops Subwords**: Tokenization breaks our sentences down into these little pieces called tokens and gives every single token its own unique number. The solution, subword tokens, by breaking words down into their common parts, the AI can actually see the relationship between run, running, and ran. [01:35], [02:04] - **Self-Attention Fixes Memory**: The big breakthrough, the solution to this memory problem is this brilliant idea called self attention. It's like a magical highlighter. As it's reading a sentence, it can highlight which other words, even words from way earlier in the paragraph, are most important. [02:47], [02:58] - **Transformers from 2017 Paper**: The concept of self attention was so powerful that it became the beating heart of a whole new system laid out in a 2017 research paper called 'attention is all you need.' That introduced the transformer architecture, the foundation for pretty much every major AI we use today. [03:17], [03:35] - **RLHF Aligns with Human Values**: This whole four-step dance has a name. It's called reinforcement learning from human feedback, or RLHF for short. It's an incredibly powerful technique that lets us steer the AI's behavior and align it with what we actually want. [05:56], [06:03] - **Emergent Capabilities from Scale**: One of the most wild things about these models is that as they get bigger, they start to develop what are called emergent capabilities, skills they were never ever explicitly taught. Things like being able to do basic math or write computer code just appear. [07:14], [07:25]

Topics Covered

LLMs Solely Predict Next Word
Subword Tokenization Reveals Word Relationships
Self-Attention Solves Context Memory
RLHF Aligns AI with Human Values
Scale Unlocks Emergent Capabilities

Full Transcript

All right. Today, we're going to do something a little ambitious. We're

going to build a digital brain completely from scratch. And by the end of this, I promise you'll have a much better picture of how these large language models, the AIs that are changing everything, actually, well,

think. So, have you ever used an AI? You

think. So, have you ever used an AI? You

ask it a question and the answer it gives you is just stunning. It feels

like magic, right? But here's the thing, it's not. Today, we're going to pop the

it's not. Today, we're going to pop the hood on that digital mind and actually build one ourselves, step by step, so you can see exactly how the whole thing really works. So, where do we even

really works. So, where do we even start? Well, it all kicks off with a

start? Well, it all kicks off with a goal that is frankly surprisingly simple. Just forget for a moment about

simple. Just forget for a moment about writing beautiful poetry or explaining quantum mechanics. At its very core, all

quantum mechanics. At its very core, all a large language model is trying to do, its one and only job, is to predict the next word in a sentence. That's it.

Seriously. But when you do that one simple thing at a ridiculous scale with billions and billions of data points, something truly incredible starts to happen. Okay, so part one, every

happen. Okay, so part one, every building project needs raw materials, right? For our digital brain, that

right? For our digital brain, that material is basically all of human language. But we have a big problem.

language. But we have a big problem.

Computers don't speak English or Spanish or Japanese. They speak math. So our

or Japanese. They speak math. So our

very first job is to translate our words into the only language they understand, numbers. This whole translation process

numbers. This whole translation process has a name. It's called tokenization.

The easiest way to think about it is like you're cooking a meal. You can't

just toss a whole carrot in the pot. You

have to chop it up into smaller, more manageable pieces first. Tokenization

does the exact same thing for language.

It breaks our sentences down into these little pieces called tokens. And then it gives every single token its own unique number. Now, what's really, really

number. Now, what's really, really clever is how modern AIs do this chopping. The old way was to just break

chopping. The old way was to just break text into whole words, but that was super clunky. A word like running and

super clunky. A word like running and the word ran were seen as completely different things, totally unrelated, and if it saw a new word, something like webinard, the whole system would just

freak out. So, the solution, subword

freak out. So, the solution, subword tokens. By breaking words down into

tokens. By breaking words down into their common parts, the AI can actually see the relationship between run, running, and ran. It can even piece together new words or figure out typos, which makes it way more flexible. Okay,

so we've got our raw materials. Our

language has been chopped up and turned into a stream of numbers. Now, we need the engine. This is part two, the core

the engine. This is part two, the core architecture. This is where we actually

architecture. This is where we actually design the thinking machine itself. And

this brings us to a massive problem that just plagued early AI for years. I want

you to imagine reading a really long, complicated paragraph, but by the time you get to the last sentence, you've completely forgotten how it started.

That's exactly what it was like for older models. They had just awful

older models. They had just awful short-term memory which made understanding context pretty much impossible. The big breakthrough, the

impossible. The big breakthrough, the solution to this memory problem is this brilliant idea called self attention.

You can think of it like giving our AI a superpower. It's like a magical

superpower. It's like a magical highlighter. As it's reading a sentence,

highlighter. As it's reading a sentence, it can highlight which other words, even words from way earlier in the paragraph, are most important for understanding the specific word it's looking at right now.

This is what allows it to draw connections and finally finally get a real sense of context. And this just shows you how huge that idea was. The

concept of self attention was so powerful, so revolutionary that it became the beating heart of a whole new system. It was all laid out in a 2017

system. It was all laid out in a 2017 research paper with the perfect title, attention is all you need. That paper

introduced was called the transformer architecture, and it completely changed the game. It is the foundation for

the game. It is the foundation for pretty much every major AI we use today.

So, let's take stock. We've built the brain's architecture. We have the

brain's architecture. We have the engine, but right now it's just an empty vessel. It doesn't know anything. So,

vessel. It doesn't know anything. So,

now it's time for part three, the education. It's time to send our digital

education. It's time to send our digital brain to school. Now, this education happens on a scale that is just hard to wrap your head around. A model like GPT3

has 175 billion parameters. You can

think of each of these parameters as a tiny little knob that controls the strength of the connection between two ideas. And as the AI reads and reads, it

ideas. And as the AI reads and reads, it is constantly tuning all 175 billion of these knobs, learning that the word sky is very strongly connected to blue and that queen is connected to king. And

it's doing all of this just to get better at its one core job, predicting the next word. This whole education process really happens in two main phases. First up, there's pre-training.

phases. First up, there's pre-training.

This is basically like giving the AI a library card to the entire internet and just letting it read well everything. It

just soaks up grammar, facts, common sense, and even how we reason about things. After all that, then comes

things. After all that, then comes fine-tuning. And this is more like

fine-tuning. And this is more like specialized job training. We give it very specific, highquality examples to teach it how to be a helpful assistant that can actually follow our instructions. Okay, so this brings us to

instructions. Okay, so this brings us to a really crucial point. After all that pre-training, our model knows a ton about the world, but it has no idea what we humans actually want. It doesn't know

what we think is helpful or polite or safe. It's kind of like a brilliant

safe. It's kind of like a brilliant student who's read every book in the library, but has zero social skills.

This final step is all about teaching the AI our values. So, how do we teach it something as fuzzy as our preferences? Well, with this really

preferences? Well, with this really clever feedback loop. Here's how it works. Step one, we give the AI a prompt

works. Step one, we give the AI a prompt and have it generate a few different answers. Step two, and this is the key

answers. Step two, and this is the key part, a human looks at those answers and just ranks them from best to worst. We

don't have to write the perfect answer.

We just have to show it what we like better. Then in step three, all that

better. Then in step three, all that ranking data is used to train a second separate AI, a reward model, whose only job is to get really good at predicting how a human would rank any given answer.

And finally, in step four, we use that AI judge to coach our main model, updating it to generate the kinds of answers that would get a high score from the judge. This whole four-step dance

the judge. This whole four-step dance has a name. It's called reinforcement learning from human feedback, or RLHF for short. It's an incredibly powerful

for short. It's an incredibly powerful technique that lets us steer the AI's behavior and align it with what we actually want. This is the secret sauce

actually want. This is the secret sauce that turns a raw, super knowledgeable text predictor into a reliable, and helpful AI assistant that we can actually work with. All right, let's step back for a second and look at what

we've built. We started with literally

we've built. We started with literally nothing, just an idea, and now we've assembled a complete digital brain. So,

let's put it all together. Remember

where we started with that one almost comically simple goal, just predict the next word. But by taking that simple

next word. But by taking that simple idea, scaling it up with a powerful architecture, and feeding it an education the size of the internet, we ended up with a system that can show real complex reasoning, creativity, and

a genuine seeming understanding of the world. So, let's recap the build. We

world. So, let's recap the build. We

started with our raw materials using tokenization to turn human language into computer friendly numbers. Then, we

built our core engine, the transformer, all powered by that game-changing idea of self attention. Next, we gave it a massive education through pre-training.

And finally, we gave it some coaching using human feedback to make it helpful and align it with our own values. And

that leaves us with one final and honestly kind of mind-bending question.

See, one of the most wild things about these models is that as they get bigger, they start to develop what are called emergent capabilities, skills they were never ever explicitly taught. Things

like being able to do basic math or write computer code just appear. So, as

we keep scaling these digital brains up, the real question isn't just what we're going to teach them next, but what they might start to learn all on their

Loading...

Loading video analysis...