9 AI Concepts Explained in 7 minutes: AI Agents, RAGs, Tokenization, RLHF, Diffusion, LoRA...
By ByteByteAI
Summary
## Key takeaways - **Tokenization converts text into integers**: Neural networks like LLMs cannot work with raw text directly; a tokenizer breaks text into smaller units called tokens and maps each token to an integer ID so the model can process it. [00:10], [00:18] - **Greedy decoding lacks creativity**: Greedy decoding always picks the most likely next token and works well for deterministic tasks, but not for tasks requiring creativity where sampling-based methods add controlled randomness. [01:08], [01:18] - **LLM alone cannot take actions**: An LLM on its own only generates text and cannot take actions like browsing the web, checking the weather, or running code; multi-step agents wrap an LLM in a loop with access to tools and memory to plan and execute tasks. [02:25], [02:38] - **RAG grounds answers in external evidence**: A plain LLM answers using only what is stored in its weights, so it can be wrong or outdated on recent events; RAG pairs an LLM with a retrieval system connected to a knowledge store so it pulls relevant passages to write grounded answers. [02:57], [03:18] - **RLHF made ChatGPT succeed**: The initial launch of ChatGPT succeeded largely because of the RLHF stage, which uses a reward model trained from human preference pairs to push the model toward outputs that people rate as more helpful, clear, and safe. [03:30], [03:47] - **LoRA fine-tunes without full retraining**: LoRA keeps original linear layer weights frozen and adds two small low-rank trainable matrices, allowing a model to learn domain-specific adjustments with far fewer new parameters than full fine-tuning. [06:14], [06:24]
Topics Covered
- Tokenizers Use BPE Merging
- Top-P Sampling Boosts Diversity
- Chain-of-Thought Enables Multi-Step Logic
- RAG Grounds Answers in Evidence
- LoRA Adapts Giants Efficiently
Full Transcript
Most modern AI products are built from the same set of core ideas. In the next seven minutes, I'll walk through nine concepts you will see repeatedly across
real world AI systems. One, tokenization. Neural networks like LMS
tokenization. Neural networks like LMS cannot work with raw text directly. A
tokenizer breaks text into smaller units called tokens and maps each token to an integer ID. So, the model can take the
integer ID. So, the model can take the sequence as input instead of raw text.
The most common algorithm is bite pair encoding or BPE. BPE starts from small units often bytes or characters and repeatedly merges the most frequent
adjacent pairs to form new tokens. Over
time, common fragments like ing or ti become single tokens. So words like walking might be split as walk plus ing.
Two, text decoding. An LLM simply outputs a probability distribution over the vocabulary for the next token. A
decoding algorithm chooses one token from that distribution, appends it to the sequence, and repeats the process to produce a full response. The simplest
text decoding approach is greedy decoding, which always picks the most likely next token. It can work well for deterministic tasks, but not for tasks
requiring creativity. Sampling based
requiring creativity. Sampling based methods add controlled randomness to improve diversity. For example, top P
improve diversity. For example, top P sampling draws the next token from the smallest set of tokens whose probabilities sum to P, then samples from that set. Three, prompt
engineering. Vake prompts usually lead to vague answers. Prompt engineering is the practice of shaping instructions and context to steer a model's behavior
without changing its weights. A strong
prompt clearly states the task key constraints and expected output format.
One common technique is fshot prompting where you include a handful of examples so the model imitates the desired style and structure. Another is chain of
and structure. Another is chain of thought prompting which you ask for step-by-step reasoning. Coot prompting
step-by-step reasoning. Coot prompting can improve performance on problems that require multi-step logic like math and coding. Prompt engineering is widely
coding. Prompt engineering is widely used because it is fast to iterate on and inexpensive compared to training or fine-tuning a model. Four, multi-step AI
agents. An LLM on its own only generates
agents. An LLM on its own only generates text. It cannot take actions like
text. It cannot take actions like browsing the web, checking the weather, or running code. Multi-step agents wrap an LLM in a loop with access to tools
and memory. So it can plan what to do
and memory. So it can plan what to do next, call external tools, and use the results to decide the next step. The
agent repeats this cycle until it reaches the goal, runs out of a budget, or determines it cannot make further progress.
Five, retrieval augmented generation. A
plain LLM answers using only what is stored in its weights. So it can be wrong or outdated on recent events or changing company policies. Rag pairs an
LLM with a retrieval system connected to a knowledge store. When you ask a question, the retriever first pulls relevant passages from sources like
PDFs, docs, or a database. Then the LLM uses those passages to write the answer.
This grounds the response in external evidence instead of relying only on the model's memory.
Six, reinforcement learning from human feedback. The initial launch of Chad GPT
feedback. The initial launch of Chad GPT succeeded in large because of the RLHF stage. RLHF is a reinforcement learning
stage. RLHF is a reinforcement learning approach where the model practices by generating multiple candidate responses.
A separate reward model scores them and the training algorithm updates the model's weights. So higher scoring
model's weights. So higher scoring responses become more likely over time.
This pushes the model toward outputs that people consistently rate as more helpful, clear, and safe, not just outputs that are statistically likely.
RLHFs align an LLM with human preferences, mainly because of how the reward model is trained. The reward
model learns directly from human feedback, usually from pairs of model responses to the same prompt where annotators pick the one they prefer. By
learning these preference patterns, the reward model becomes a proxy for what humans tend to want and reinforcement learning uses that signal to steer the
LLM toward responses that score higher on that proxy.
Seven, variational autoenccoder. A VAE
is a generative modeling approach that learns a probability distribution of data. A VAE consists of two neural
data. A VAE consists of two neural networks, an encoder and a decoder. The
encoder maps the input into a lowdimensional latent representation while the decoder maps the latent vector back to the original input space.
Training optimizes a reconstruction objective so the decoded output stays close to the original input. After
training, new data can be generated by sampling a point from the latent space and decoding it. In modern text to image and texttovideo systems like OpenAI's
Sora, a VA is often used as a latent compressor, allowing the downstream model to operate more efficiently in a smaller space.
Eight, diffusion models. Diffusion
models generate data by learning to reverse a gradual noising process.
During training, you take real samples like images, add noise over many time steps, and train a model to predict the noise given the noisy input. the time
step and optional conditioning such as text. At inference time, you start from
text. At inference time, you start from pure noise and repeatedly apply the learn the noising step to move toward a clean sample.
Nine, low rank adaptation. Large models
like LLMs and textto image systems are general purpose. They handle broad
general purpose. They handle broad everyday tasks well, but often struggle in specialized domains. Laura is an efficient fine-tuning method that adapts
a pre-trained model without updating all of its parameters. It keeps the original linear layer weights frozen and adds two small low rank trainable matrices. So
the model can learn a domain specific adjustments with far fewer new parameters. With this foundation, you
parameters. With this foundation, you should find reading future AI designs and articles much easier.
Loading video analysis...