Wk01 - Stanford CS146S - Introduction to Coding LLMs and AI Development

By AI With Ryan

Summary

Topics Covered

Pre-training Compresses Internet into Probabilistic Zip
ChatGPT Simulates Ideal Human Labeler
Models Need Tokens to Think Step-by-Step
AI Augments Engineers with 10x Speed
LLMs May Invent Superior Human Logic

Full Transcript

We've all interacted with them. Large

language models. I mean, whether it's asking chat GPT for a recipe or having an AI summarize a huge document, >> right?

>> You hit send and almost instantly you get something back that feels intelligent. But what's actually

intelligent. But what's actually happening in that black box after you hit send?

>> And maybe more importantly, what happened for months or even years before you ever typed that first prompt?

>> Exactly.

>> That's the core question we're diving into today. This deep dive is really

into today. This deep dive is really about pulling back the curtain on this incredible uh multi-stage engineering pipeline that transforms raw messy

internet data into the AI assistant you use.

>> So we're talking about the fundamental stages of training.

>> Yep. We're tracking the three big ones.

Pre-training, supervised fine-tuning, and reinforcement learning. And then

we'll get into the advanced prompting techniques you actually need to command these things. Our sources today come

these things. Our sources today come from some pretty deep technical papers and engineering lectures that you know they really reveal the stunning scale and the clever tricks behind all of this.

>> So our mission for you the learner is to really get the true architecture of these systems. >> We're going to trace the model's evolution from what trillions of pieces of text it consumes all the way up to the methods that make it seem like it

can reason.

>> So let's start at the very beginning.

Before an LLM is a helpful assistant, it starts as what we call a base model. And

what is that functionally?

>> Functionally, it is a massive highly sophisticated text completion machine.

The goal of this first stage pre-training is just knowledge acquisition on a scale that is genuinely um hard to wrap your head around.

>> The first step being data gathering, which I know from the sources is just biblical in scope.

>> It really is. It begins with this intense data scraping mostly from the deep archives of the internet. The

biggest source is often common crawl.

That's the nonprofit that's been indexing the web for years >> since 2007. Yeah. Over 2.7 billion pages. And then that raw internet data

pages. And then that raw internet data is combined with, you know, highquality curated documents, the Wikipedia, academic journals, public code from GitHub.

>> You said the data is raw, which I mean that must be incredibly messy. I know

they don't just dump the entire internet into the model.

>> Oh, they absolutely can't. The filtering

process is ruthless. They remove

everything. spam, malware, known racist content, adult material, even just repetitive text.

>> But even after all that culling, the data sets are still astronomical >> completely. A representative,

>> completely. A representative, highquality public data set like fine web, it still clocks in at about 44 tab of text data.

>> Wait, 44 terabytes of just text? That's

dizzying.

>> It is. And that text needs to be processed. It gets converted into

processed. It gets converted into sequences of tokens >> which are basically small chunks of text. A word or even part of a word.

text. A word or even part of a word.

>> Exactly. And that fine web data set alone translates to approximately 15 trillion tokens. 15,000 billion pieces

trillion tokens. 15,000 billion pieces of training data that the model has to absorb.

>> So we have the trillions of tokens. Now

what's the core mechanical operation here? What are those billions of dollars

here? What are those billions of dollars in GPUs actually doing?

>> Well, the mechanism is conceptually surprisingly simple. LLMs are auto

surprisingly simple. LLMs are auto reggressive models. All that means is

reggressive models. All that means is that they are predicting the probability of the next token in a sequence based entirely on the text that came before it. If I say the cats on the

it. If I say the cats on the >> the model predicts Matt rug or something like that.

>> Precisely. It calculates the most probable completion. Yeah.

probable completion. Yeah.

>> And modern models can handle a stunning amount of context to make that prediction.

>> Right. The context window, it's working memory has gotten huge.

>> It's grown exponentially. An older model like GPT2 could only handle maybe a thousand tokens of context. Modern ones

can look back at hundreds of thousands, sometimes even a million tokens.

>> And this prediction is all mathematical.

It runs that context through a neural network with billions or even trillions of adjustable parameters.

>> Think of them like knobs on a giant mixing board. 1.6 billion for GPT2,

mixing board. 1.6 billion for GPT2, maybe a reported 1.8 trillion for GPT4.

>> And training is just the process of nudging those knobs over and over. So

the model's prediction of the next token matches what was actually in the training data >> millions of times per second. And this

optimization requires an almost unbelievable amount of specialized computational power.

>> This is the gold rush you mentioned.

>> It is. Companies are rushing to rent these massive machines like an Apex H100 GPU node which can cost about $3 per GPU per hour. The sheer expense is why

per hour. The sheer expense is why training a big foundational model costs tens of millions of dollars. That cost

analysis really dictates the whole structure of the industry, doesn't it?

Only a few players can even afford to create these base models.

>> It does because every single one of those expensive machines is just trying to predict the next token faster and more accurately.

>> So the goal of pre-training is to create a massive lossy compression of the internet.

>> That's a powerful way to think about it.

A probabilistic zip file of human knowledge.

>> But if it's compressed, what information gets lost?

>> Well, it loses nuance. It loses the ability to reason or to understand what is truthful versus what is just common on the internet. It learns the syntax of knowledge but not really the deep structure of logic

>> which leads us perfectly into the next stage that base model that expensive autocomplete is not what we use when we talk to chat GPT.

>> Exactly. The output of pre-training is just the base model. To turn it into a helpful conversational assistant, we have to enter the post-training stages.

And this starts with supervised fine-tuning or SFT.

>> SFT is the first step in shaping the model's personality. And it's

model's personality. And it's computationally way cheaper. We're

talking hours instead of months.

>> And you swap out the data, >> right? We swap the raw internet text for

>> right? We swap the raw internet text for a smaller, highly curated data set, hundreds of thousands of examples of highquality dialogues.

>> This is where human labelers come in.

They're basically defining the AI's persona.

>> Absolutely. Humans are hired, given these really strict, detailed instructions, and they're tasked with writing the ideal assistant response for all kinds of prompts. They're literally

teaching the LLM to adopt that helpful, truthful, and harmless persona by example.

>> So, when I talk to an AI assistant, I'm not talking to some new form of intelligence. I'm interacting with a

intelligence. I'm interacting with a neural network that is simulating the average, highly skilled human labeler who followed those instructions perfectly.

>> That's the simplest way to put it. Yeah.

you're talking to the most expensive, highly optimized simulation of human customer service ever built.

>> Okay. So, once the model has the right persona, we need to teach it how to actually solve problems. >> And that moves us to the third most advanced stage, reinforcement learning or RL.

>> How is this different from SFT?

>> SFT taught it what to say. RL teaches it how to think. With RL, the model can practice problem solving in verifiable domains like math and code where we know

for a fact what the correct final answer is.

>> So you give it the problem and the answer, but it has to figure out how to get there.

>> It has to discover the sequence of tokens that reliably leads to that result. And this is where it gets truly

result. And this is where it gets truly fascinating >> because this is where we see the model's cognitive strategies emerge. It learns

to think by um creating a kind of internal monologue.

>> That's right. Because the model has finite computation per token, it learns it has to distribute its reasoning across multiple tokens. We call this chain of thought or coach a day.

>> So if you just ask it to spit out the answer to a hard math problem in one go, it fails.

>> It will consistently fail at the mental arithmetic. Yeah. You have to train it

arithmetic. Yeah. You have to train it to spread out the calculation step by step.

>> That's the key takeaway for the user right there. The model needs tokens to

right there. The model needs tokens to think. If you don't let it think step by

think. If you don't let it think step by step, you're asking it to do calculus without any scratch paper. Precisely.

Now, this pure RL process works great when the answer is objectively verifiable. Math, code, chess.

verifiable. Math, code, chess.

>> But what about things that are subjective, like writing a joke or a persuasive essay?

>> For those domains, companies use an adaptation called reinforcement learning from human feedback or RLHF.

>> This is where they train a second AI, a reward model, to act like a human judge.

>> Correct. A human ranks different AI outputs and the reward model learns to replicate those rankings. It becomes a proxy for human taste.

>> But the sources point out a major drawback here.

>> They do. Reinforcement learning is extremely good at finding ways to gain any simulation. The LLM can learn to

any simulation. The LLM can learn to generate content that just tricks the reward model into giving a high score, prioritizing optimization over genuine quality.

>> So RHF is powerful, but it's a short-lived process. The model will

short-lived process. The model will eventually learn to exploit the reward system.

>> Exactly. Human oversight remains mandatory.

>> Okay. So once you have this highly trained model knowledge, personality and some reasoning ability, the final stage is all about prompt engineering.

>> Right. The art of command. It's

basically programming the model through conversation.

>> A foundational technique here is in context learning. Kshot prompting.

context learning. Kshot prompting.

>> Yep. You just give the model a few examples, maybe one, three, or five, right inside the prompt itself.

>> So if I need it to translate sentences into Korean using a very specific honorific style, >> you just give a one or two examples of that exact style and the model instantly adapts. It learns the pattern in

adapts. It learns the pattern in context.

>> But for more complex logical tasks, we have to go back to chain of thought. we

have to. And even a simple instruction like let's think step by step which is called zeroot cot. It dramatically

improves accuracy. You're unlocking the model's ability to reason out loud.

>> But the model's knowledge is frozen in time, right? It doesn't know what

time, right? It doesn't know what happened yesterday or what's in my company's private database.

>> And that is why external tools are so critical for real world tasks, particularly retrieval augmented generation or agg. A is our crucial

defense against hallucinations. It's

like giving the AI a smart Google search function.

>> That's a great way to frame it. The LLM

realizes it needs current information, so it generates special tokens that invoke a tool, like a web search, >> and the results of that search get fed back into the AI's working memory,

>> inserted directly into its context window right before it generates the final answer. It forces the model to use

final answer. It forces the model to use current facts instead of just its old training memory. We can refine this even

training memory. We can refine this even more with selfcorrection techniques >> for sure. Take self-consistency. Instead

of asking the model a question once, you ask it five times. You force it to find five different reasoning paths and then you take the majority answer.

>> That cuts down on random errors. And

then there's reflection, which is great for coding.

>> It's fantastic. You feed an error message back into the prompt and ask the model to critique and revise its own code. You're teaching it to learn from

code. You're teaching it to learn from its mistakes in real time. These

techniques are what separate simple chatting from high leverage professional use.

>> Absolutely. Which brings us to the core shift happening for professionals. The

discussion is always about replacement.

>> But the material highlights a different idea. The mantra is you won't be

idea. The mantra is you won't be replaced by AI. You'll be replaced by a competent engineer who knows how to use AI.

>> The LLM isn't replacing the engineer.

It's augmenting them. Turning a regular engineer into a hyperefficient one. We

see this with internal use of places like OpenAI with their codeex models.

>> Okay, let's take a concrete example code understanding. An engineer pastes a

understanding. An engineer pastes a stack trace from an error. Normally

that's an hour of digging, >> right? Instead, they prompt the model.

>> right? Instead, they prompt the model.

Given this stack trace, locate the O logic in this repo and trace the data flow. That saves 45 minutes just like

flow. That saves 45 minutes just like that.

>> Refactoring is another huge one.

Updating a legacy API call across dozens of files, that's a tedious three-hour manual job. With an LLM, you're prompted

manual job. With an LLM, you're prompted to apply this change consistently across all 40 files and it takes 90 seconds.

That's real leverage.

>> Or performance optimization.

Instead of hunting for a bottleneck for an hour, you have the model analyze the code, flag the slow database calls, and draft a fix.

>> 30 minutes of work becomes 5 minutes of prompting. And finally, QA and testing.

prompting. And finally, QA and testing.

[snorts] You can point the model at low coverage code overnight and wake up to runnable unit tests that cover edge cases a human might have missed.

>> But to get this kind of precision, you can't just type a casual request. There

are best practices.

>> Yes. First is role prompting. You have

to aggressively define the persona. You

are a helpful assistant that loves programming at the level of a senior software developer.

>> And structured prompts using tags like error or log when you paste in complex data so the model can parse it efficiently. And the most advanced

efficiently. And the most advanced technique is persistent context.

Engineers will keep a file like agents.mmd with all the business logic

agents.mmd with all the business logic and naming conventions the AI can't possibly know >> and they paste that context into every session.

>> Yep. It keeps the model informed and dramatically improves accuracy across all your prompts. So, we've charted this whole path today from trillions of raw internet tokens through humanlabeled

conversations that give the AI its personality right up to the advanced reasoning that requires chain of thought.

>> The core takeaway for you, the learner, is that LLMs are powerful, but they have these uh cognitive deficits. We call it the Swiss cheese model of capability.

>> The holes don't line up >> exactly. It can solve an olympied level

>> exactly. It can solve an olympied level math problem but then failed to compare two simple numbers like 9.1 and 9.9. You

have to treat them as stochastic tools.

Always check their work and use these techniques to compensate for their limitations.

>> Okay, let's unpack this for one final thought. Something for you to really

thought. Something for you to really maul over after this deep dive.

>> When LLMs are trained with pure reinforcement learning on problems where the answer is verifiable, the model isn't constrained by human logic. We saw

this years ago with AlphaGo discovering move 37. this brilliant totally

move 37. this brilliant totally unforeseen strategy, >> a move no human would have considered.

>> Right? So, if these thinking models continue to evolve in open domain, problem solving, discovering new superior cognitive strategies, maybe even a new internal language for thought that's more efficient than English. What

will it mean for us when the optimal way to solve a complex business problem is something no human could have conceived?

That raises a really important question about the future of human AI collaboration.

Loading...

Loading video analysis...