LongCut logo

9 AI Concepts Explained in 7 minutes: AI Agents, RAGs, Tokenization, RLHF, Diffusion, LoRA...

By ByteByteAI

Summary

## Key takeaways - **Tokenization converts text into integers**: Neural networks like LLMs cannot work with raw text directly; a tokenizer breaks text into smaller units called tokens and maps each token to an integer ID so the model can process it. [00:10], [00:18] - **Greedy decoding lacks creativity**: Greedy decoding always picks the most likely next token and works well for deterministic tasks, but not for tasks requiring creativity where sampling-based methods add controlled randomness. [01:08], [01:18] - **LLM alone cannot take actions**: An LLM on its own only generates text and cannot take actions like browsing the web, checking the weather, or running code; multi-step agents wrap an LLM in a loop with access to tools and memory to plan and execute tasks. [02:25], [02:38] - **RAG grounds answers in external evidence**: A plain LLM answers using only what is stored in its weights, so it can be wrong or outdated on recent events; RAG pairs an LLM with a retrieval system connected to a knowledge store so it pulls relevant passages to write grounded answers. [02:57], [03:18] - **RLHF made ChatGPT succeed**: The initial launch of ChatGPT succeeded largely because of the RLHF stage, which uses a reward model trained from human preference pairs to push the model toward outputs that people rate as more helpful, clear, and safe. [03:30], [03:47] - **LoRA fine-tunes without full retraining**: LoRA keeps original linear layer weights frozen and adds two small low-rank trainable matrices, allowing a model to learn domain-specific adjustments with far fewer new parameters than full fine-tuning. [06:14], [06:24]

Topics Covered

  • Tokenizers Use BPE Merging
  • Top-P Sampling Boosts Diversity
  • Chain-of-Thought Enables Multi-Step Logic
  • RAG Grounds Answers in Evidence
  • LoRA Adapts Giants Efficiently

Full Transcript

Most modern AI products are built from the same set of core ideas. In the next seven minutes, I'll walk through nine concepts you will see repeatedly across

real world AI systems. One, tokenization. Neural networks like LMS

tokenization. Neural networks like LMS cannot work with raw text directly. A

tokenizer breaks text into smaller units called tokens and maps each token to an integer ID. So, the model can take the

integer ID. So, the model can take the sequence as input instead of raw text.

The most common algorithm is bite pair encoding or BPE. BPE starts from small units often bytes or characters and repeatedly merges the most frequent

adjacent pairs to form new tokens. Over

time, common fragments like ing or ti become single tokens. So words like walking might be split as walk plus ing.

Two, text decoding. An LLM simply outputs a probability distribution over the vocabulary for the next token. A

decoding algorithm chooses one token from that distribution, appends it to the sequence, and repeats the process to produce a full response. The simplest

text decoding approach is greedy decoding, which always picks the most likely next token. It can work well for deterministic tasks, but not for tasks

requiring creativity. Sampling based

requiring creativity. Sampling based methods add controlled randomness to improve diversity. For example, top P

improve diversity. For example, top P sampling draws the next token from the smallest set of tokens whose probabilities sum to P, then samples from that set. Three, prompt

engineering. Vake prompts usually lead to vague answers. Prompt engineering is the practice of shaping instructions and context to steer a model's behavior

without changing its weights. A strong

prompt clearly states the task key constraints and expected output format.

One common technique is fshot prompting where you include a handful of examples so the model imitates the desired style and structure. Another is chain of

and structure. Another is chain of thought prompting which you ask for step-by-step reasoning. Coot prompting

step-by-step reasoning. Coot prompting can improve performance on problems that require multi-step logic like math and coding. Prompt engineering is widely

coding. Prompt engineering is widely used because it is fast to iterate on and inexpensive compared to training or fine-tuning a model. Four, multi-step AI

agents. An LLM on its own only generates

agents. An LLM on its own only generates text. It cannot take actions like

text. It cannot take actions like browsing the web, checking the weather, or running code. Multi-step agents wrap an LLM in a loop with access to tools

and memory. So it can plan what to do

and memory. So it can plan what to do next, call external tools, and use the results to decide the next step. The

agent repeats this cycle until it reaches the goal, runs out of a budget, or determines it cannot make further progress.

Five, retrieval augmented generation. A

plain LLM answers using only what is stored in its weights. So it can be wrong or outdated on recent events or changing company policies. Rag pairs an

LLM with a retrieval system connected to a knowledge store. When you ask a question, the retriever first pulls relevant passages from sources like

PDFs, docs, or a database. Then the LLM uses those passages to write the answer.

This grounds the response in external evidence instead of relying only on the model's memory.

Six, reinforcement learning from human feedback. The initial launch of Chad GPT

feedback. The initial launch of Chad GPT succeeded in large because of the RLHF stage. RLHF is a reinforcement learning

stage. RLHF is a reinforcement learning approach where the model practices by generating multiple candidate responses.

A separate reward model scores them and the training algorithm updates the model's weights. So higher scoring

model's weights. So higher scoring responses become more likely over time.

This pushes the model toward outputs that people consistently rate as more helpful, clear, and safe, not just outputs that are statistically likely.

RLHFs align an LLM with human preferences, mainly because of how the reward model is trained. The reward

model learns directly from human feedback, usually from pairs of model responses to the same prompt where annotators pick the one they prefer. By

learning these preference patterns, the reward model becomes a proxy for what humans tend to want and reinforcement learning uses that signal to steer the

LLM toward responses that score higher on that proxy.

Seven, variational autoenccoder. A VAE

is a generative modeling approach that learns a probability distribution of data. A VAE consists of two neural

data. A VAE consists of two neural networks, an encoder and a decoder. The

encoder maps the input into a lowdimensional latent representation while the decoder maps the latent vector back to the original input space.

Training optimizes a reconstruction objective so the decoded output stays close to the original input. After

training, new data can be generated by sampling a point from the latent space and decoding it. In modern text to image and texttovideo systems like OpenAI's

Sora, a VA is often used as a latent compressor, allowing the downstream model to operate more efficiently in a smaller space.

Eight, diffusion models. Diffusion

models generate data by learning to reverse a gradual noising process.

During training, you take real samples like images, add noise over many time steps, and train a model to predict the noise given the noisy input. the time

step and optional conditioning such as text. At inference time, you start from

text. At inference time, you start from pure noise and repeatedly apply the learn the noising step to move toward a clean sample.

Nine, low rank adaptation. Large models

like LLMs and textto image systems are general purpose. They handle broad

general purpose. They handle broad everyday tasks well, but often struggle in specialized domains. Laura is an efficient fine-tuning method that adapts

a pre-trained model without updating all of its parameters. It keeps the original linear layer weights frozen and adds two small low rank trainable matrices. So

the model can learn a domain specific adjustments with far fewer new parameters. With this foundation, you

parameters. With this foundation, you should find reading future AI designs and articles much easier.

Loading...

Loading video analysis...