RAG Crash Course for Beginners

By KodeKloud

Summary

## Key takeaways - **RAG Fixes LLM Hallucinations**: LLMs like ChatGPT hallucinate incorrect answers on private company policies because they lack specific context. RAG solves this by retrieving relevant document sections, augmenting the prompt, and generating accurate responses. [01:16], [01:43] - **Not All Problems Need RAG**: RAG is not the solution for everything; use prompt engineering for restrictions and security, fine-tuning for stable voice and style like a CEO's Scottish tone. Fine-tuning fails for dynamic policies due to retraining costs, no citations, and knowledge cutoff. [03:27], [07:21] - **Keyword Search Fails on Synonyms**: Keyword search using TF-IDF or BM25 fails when exact terms like 'allowance' instead of 'reimbursement' or 'work from home' instead of 'home office' are not matched. Semantic search overcomes this by understanding meaning through embeddings. [15:05], [15:44] - **Embeddings Capture Semantic Similarity**: Embedding models like all-MiniLM-L6-v2 convert text to 384-dimensional vectors where similar meanings cluster closely; dogs vs pets scores 73.3% similarity while dogs vs remote work is only 36.2%. This enables finding relevant content without exact word matches. [24:03], [24:44] - **Vector DBs Scale Retrieval**: Vector databases like Chroma use HNSW indexing to group similar vectors into neighborhoods, avoiding exhaustive 192,000 calculations per query on 500 documents. They retrieve precise results instantly like a smart librarian. [30:52], [31:52] - **Chunking Balances Context Precision**: Without chunking, large 50-page handbooks return irrelevant entire documents; fixed-size chunks of 200-500 characters with 50-100 overlap preserve meaning and deliver focused sections. Too small loses context, too large overwhelms with irrelevance. [39:07], [41:24]

Topics Covered

RAG Fixes LLM Hallucinations
Use Fine-Tuning for Style, RAG for Facts
Semantic Search Beats Keyword Matching
Chunking Balances Context and Precision
Cache Embeddings, Not Just Responses

Full Transcript

Everyone's talking about RAG. If you

feel left out, this is the only video you need to watch to catch up. In this

video, we'll learn Rag in a super simplified manner with visualizations that will make it easy for anyone to understand. No background knowledge in

understand. No background knowledge in AI or AI models or coding or programming required. We'll start with the simplest

required. We'll start with the simplest explanation of rag there is. Then we'll

look into when to and when not to rag.

We'll then look into what is rag. We'll

then understand some of the prerequisites such as vector search versus semantic search, embedding models, vector DB, chunking using a simple use case, and finally bring all

of that together into rack architecture.

We'll then look into caching, monitoring, and error handling techniques, and close with exploring a brief setup of deploying rack in production. But that's not all. This is

production. But that's not all. This is

not just a theory course. We have

hands-on labs after each lecture that will help you practice what you learned.

Our labs open up instantly right in the browser. So there is no need to spend

browser. So there is no need to spend time setting up an environment. These

labs are staged with challenges that will help you think and learn by doing and it comes absolutely free with this course. I'll let you know how to go

course. I'll let you know how to go about the labs when we hit our first lab session. For now, let's start with the

session. For now, let's start with the first topic. Let's start with the

first topic. Let's start with the simplest explanation of rag. Say you

were to ask Chad WT what's the reimbursement policy for home office setup. You already know when you ask

setup. You already know when you ask this question that Chad GBT is going to give an incorrect answer because it doesn't have access to our policy document that's private to our company.

So an LLM like GPT would hallucinate and provide an incorrect or generic answer that's common to most companies. The

problem here is that it doesn't have the necessary context of what you're asking about. So what do you do? You look up

about. So what do you do? You look up your internal policy document and get the section of the policy that describes home office setup by yourself. Then you

add that to your prompt and tell Chad GBT to refer to this policy. Now with

this additional information, Chad Gibbt is able to generate more accurate responses and that is the simplest explanation of

rack that stands for retrieval augmented generation. The part where you look up

generation. The part where you look up your internal policy documents and retrieve the relevant information is known as retrieval. The part where you improve or augment your prompt with the

retrieved information is known as augmenting. And the part where LLM

augmenting. And the part where LLM generates a response based on the augmented prompt is known as generation.

And that is something you've done unknowingly many times. Now, of course, that is a very simplified explanation of rag. And when we talk about rag systems,

rag. And when we talk about rag systems, that is not what we typically refer to.

So let's see what that is next. Now, you

don't want your users to have to locate and retrieve relevant information by themselves. Instead, you want your users

themselves. Instead, you want your users to simply ask the question, what's the reimbursement policy of home office setup? And our system that's based on

setup? And our system that's based on rag should be able to do the lookup and retrieval of relevant information, improve or augment the user's prompt and

get an LLM to generate the right response. Now, how exactly do we

response. Now, how exactly do we retrieve relevant information? How do we augment and how do we generate? And

that's what we're going to discuss throughout the rest of this video. Now,

one of the common mistakes people make is to consider rag as the solution for everything. Rack is not the solution to

everything. Rack is not the solution to all problems. At the end of the day, we're all trying to get AI to generate better responses and there are different ways to do that. We can prompt better.

That's called prompt engineering. We can

fine-tune models. And then there's rag.

When to use what? Let's take a simple use case to understand these better. So,

back to our use case. We've started to notice a lot of people copy pasting company policies into chat GBT to get answers. So we decided to build an

answers. So we decided to build an internal chatbot that can answer people's questions. We call it the

people's questions. We call it the policy copilot. It is a system that

policy copilot. It is a system that users can simply ask a question such as what's the reimbursement policy and our chatbot system should be able to locate the necessary information from the

internal policy documents and then generate accurate responses and send that back to the user. Now we also want to add some restrictions and limitations. We don't want the chatbot

limitations. We don't want the chatbot to answer everything. Some questions

should be off limits like performance review appeals or salary discussions.

And when those topics come up, we want to direct users to HR directly instead of giving them answers in the chat. We

also want our chatbot to have a specific voice and style. So our CEO has this warm Scottish accent and a particular way of speaking that makes people feel

certain way. We want our policy co-pilot

certain way. We want our policy co-pilot to sound just like that, authoritative and distinctly Scottish. So when the users ask what's the reimbursement

policy for home office setup, it responds when the users ask how many sick days do I get per year? It says

when the user asks can I work from home permanently? It says

permanently? It says and when the users ask when are performance reviews conducted it responds as you can see it's not just the Scottish accent there's this what should

I say refreshing candandor that tells it like it is let's look at how to solve each of these areas the restrictions and security requires us to define how the chatbot responds what it must reveal and

what it must not so these are strict instructions provided to the LLM to control its behavior based on users request such is never to reveal personal employee information or confidential

details. If someone asks about sensitive

details. If someone asks about sensitive topics, politely redirect them to HR.

Prompt engineering best practices are a good solution to this. Think of it as the rule book that keeps our chatbot safe and professional. Next, we look at how to solve the problem of voice,

style, and language. Now, we know if we asked Chad GBT to simply respond to me in a Scottish accent, it would. But the

accent, as we saw earlier, is not simply what we are going after here. We wanted

to speak like our Scottish CEO, use words he usually uses, the tone, the language. So, we take all of his past

language. So, we take all of his past speeches, he's given, emails written by him, blog post, videos created, and fine-tune a new model that can respond in the same language and tone. A good

solution for this is fine-tuning.

Fine-tuning is the process where you provide a model hundreds of sample questions and sample answers and have it respond to you in that way all the time.

Now, you might be wondering, why can't fine-tuning solve this information problem? Why can't we train a model with

problem? Why can't we train a model with all of the questions a user might ask and answers it can generate? The

problems are the policies can change constantly and when they do, you need to retrain the model every time and trainings are not easy. They're

expensive and slow. Retraining takes

time and computational resources. Users

can't verify where the answers came from, so there's no citations possible.

The larger the training data, the lower the accuracy. And then there's knowledge

the accuracy. And then there's knowledge cutoff. The model only knows what was in

cutoff. The model only knows what was in the training data. Fine-tuning is great for stable unchanging patterns like communication style, but terrible for dynamic factual information. And

finally, the best solution to get the most accurate responses is rag. Rag

works because it retrieves information dynamically at query time, not at training time because the whole point of rag is retrieving the most relevant information for the user's query real

time. Next, we'll look at rag in more

time. Next, we'll look at rag in more detail. Let's now look at what rag is in

detail. Let's now look at what rag is in the first place. So far, we've decided that we're going to build our policy copilot system where employees can ask

question and it retrieves the relevant information, augments prompts, and generates a response. We'll now see how each of these work. Let's look at retrieval first. Retrieval is a process

retrieval first. Retrieval is a process of retrieving relevant information. But

how do you do that? There may be hundreds of policy documents. How do you find which one is the right one that has context related to the user's question?

And what do you search for within these files? First, we identify a few keywords

files? First, we identify a few keywords from the user's question. In this case, we've identified reimbursement and home office to be the relevant keywords. One

of the simplest ways is to use a GREP command to search for specific terms in these files such as reimbursement or home office and hope that one of these files will have these terms. Alternatively, if these files were

stored in a database, you could run a query against it like this. Now, these

would only return content that exactly matches the keywords we are looking for and the chances of getting accurate information every time is low. This

approach of searching the documents with the exact words is known as keyword search and it is a very popular technique that's used by many of the search platforms. To explain it simply,

this approach goes through all the documents, identifies keywords and ranks them based on their frequency. In this

case, it counts the occurrences of reimbursement in all documents and records them. So we have three

records them. So we have three occurrences in the first document, none in the middle two, but another three in the third one. It then does the same for home office and we see that it's only present in the home office setup

document. Combining these two columns is

document. Combining these two columns is now able to identify the document that has the maximum occurrences of these two keywords and thus able to rightly select the document that has these keywords.

Now that was a super simplified explanation. Keyword search is a science

explanation. Keyword search is a science in itself and has a lot of complex calculations that go in and there are multiple proven approaches available.

Two of the most popular techniques used are known as TF and BM25. We won't go into the specifics of how these work.

We'll just see how to work with them.

Let's see each of these in action.

First, we import the TF vectorzer from the scikitlearn open-source Python library. Think of the scikit library as

library. Think of the scikit library as a toolbox with pre-built algorithms that you can use without having to write them from scratch. We then define three

from scratch. We then define three sample documents. The documents are

sample documents. The documents are simple sentences for now. you could read the contents of a file in. We then

create the TF ID of analyzer and we'll call that analyzer. The word scores can then be calculated by running the fit transform method. We then print the

transform method. We then print the results on screen. The word scores show a bi-dimensional array with the importance of each word in each sentence. The word office appears in all

sentence. The word office appears in all sentences. So they get a score of 0.4.

sentences. So they get a score of 0.4.

The first sentence identifies words, equipment, and policy and gives them a score of 0.7 and 0.5. The second

sentence identifies the words furniture and guidelines and the third identifies the words travel and policy. Now that

the vectors are created, we run a query.

We use the analyzer to transform to query the word furniture. What it does is it returns an array that returns a score that compares the query word furniture to each document. Now let's

see the same with the BM25 techniques.

We use the rank BM25 library which is a popular library that implements a BM25 algorithm. We then create what is known

algorithm. We then create what is known as the BM25 index and then get the word scores. In this case, we can see some

scores. In this case, we can see some differences. The word office gets a

differences. The word office gets a score of zero because the BM25 algorithm is a bit more strict in assigning scores. And because this word is present

scores. And because this word is present in all documents, it doesn't see it to be very relevant. It then continues to assign a score for the most important and unique words in sentences like equipment in the first sentence,

furniture and guidelines in the second, and travel in the third. And as before, we run a query, but this time using the get scores method and print the array.

We can see it's again identified the second document to be the relevant document here the right way. Well, it's

time to gain some hands-on practice on what we just learned.

Follow the link in the description below to gain free access to the labs associated with this course. Create a

free account and click on enroll to start the labs. On the left side of the screen, you will see the list of labs.

Only start the lab when I ask you to.

We'll do only one lab at a time. Let's

start with the first lab. Click on start to launch the lab. Give it a few seconds to load. Once loaded, familiarize

to load. Once loaded, familiarize yourself with the lab environment. On

the left hand side, you have a questions portal that gives you the task to do. On

the right hand side, you have a VS Code editor and terminal to the system.

Remember that this lab gives you access to a real Linux system. Click on okay to proceed to the first task. The first

task requires you to explore the document collection. Open the TechCorp

document collection. Open the TechCorp documents in the VS Code editor. On the

right, we see there is a TechCorp docs folder. Expand it to reveal the

folder. Expand it to reveal the subfolders. The ask is to count how many

subfolders. The ask is to count how many documents are in the employee handbook.

This is what I call a warm-up question that will help you explore and familiarize yourself with the lab. The

real tasks are coming up. In this case, it's three, so I select three as the answer. Then proceed to the next task.

answer. Then proceed to the next task.

This is about performing a basic GP search. As we discussed in the lecture,

search. As we discussed in the lecture, we'll run a GP command to search for anything related to holiday in the folder and store the results in a file named extracted content. To open the

terminal, click anywhere in the panel below and select terminal. This creates

a new file with the results. Click check

to check your work and continue to the next task. The next task is to set up a

next task. The next task is to set up a Python virtual environment and install dependencies. I'll let you do that

dependencies. I'll let you do that yourself.

We'll move to the next task now. Here we

explore the TF script. Here we first import the TF vectorzer from the scikitle learn library. We then

transform the dogs. Then we compare using cosine similarities or cosine is one approach of comparing two vectors to identify similarities. And then we

identify similarities. And then we finally print the results. We now

execute the script and then we view the results. And for now we'll just click

results. And for now we'll just click check to proceed to the next step. We

then move to the next step. Here the

question is to analyze the score printed and identify the score of the top results. So the ask is to search for pet

results. So the ask is to search for pet policy docs and identify the score for the top result. Here we see the top

result is rightly identified as the pet policy.md file with a score of 0.4676

policy.md file with a score of 0.4676 whereas the other files have a score less than 0.1. So the answer to this question is 0.4676.

The next task is to review and execute the BM25 script. Open the BM25 search.py

file and inspect it. You'll see that we import the rank BM25 uh package. We then

create an index and then for each query called the BM25.get scores method and from the results we get the top three results and we go through each result and print that. Finally, there is a

hybrid approach that combines TF and BM25 techniques using a weighted approach. I'll let you explore that by

approach. I'll let you explore that by yourself. Let's get back to the next

yourself. Let's get back to the next topic. We just looked at vector search.

topic. We just looked at vector search.

Let's now understand semantic search.

Now, one of the challenge with keyword search is that if the exact keyword isn't there, the search fails. For

example, instead of reimbursement, if we say allowance, it tries to find the exact word allowance. And instead of home office, if the user asks work from

home, it's unable to find that anywhere.

These combination of keywords aren't found in the documents and thus the document is not found. In our example code, say we say desk instead of furniture is not going to be able to

find any matches in scores and thus unable to find any matching document.

That's the limitation of keyword search and that's where we need semantic search. Semantic search searches

search. Semantic search searches documents based on the meaning of words and thus have higher chances of locating the right documents based on the inputs and that's what we will look at next. So

what exactly is semantic search? Think

of it as search that understands meaning not just words. When you search for allowance, semantic search can find documents about allowance or reimbursements or anything that has similar meaning even if those exact

words aren't used. Similarly, if you search for home office or work from home, it can find documents that has anything to do with remote work. The

magic happens through something called embeddings. We convert both your search

embeddings. We convert both your search query and all the documents into mathematical vectors. Think of them as

mathematical vectors. Think of them as coordinates in a highdimensional space.

Documents with similar meanings end up close together in this space. So, when

you search, we find the closest matches based on the meaning, not just word overlap. We can measure how similar two

overlap. We can measure how similar two pieces of text are by calculating the distance between their vectors. The

closer the vectors, the more similar the meaning. So reimbursement and allowance

meaning. So reimbursement and allowance would have vectors that are close together even though they're different words. We'll see this in more detail

words. We'll see this in more detail next. Let's now understand embedding

next. Let's now understand embedding models. So if you look at machine

models. So if you look at machine learning models, they can be categorized at a high level based on use case such as computer vision, NLPs or natural language processing, audio among many

others. And within each category, you

others. And within each category, you have a number of models available. This

is as shown on hugging face which is a popular platform where you can discover models, data sets and applications. Our

interest here is the sentence similarity within natural language processing.

And within sentence similarity, one of the popular models is sentence transformers all mini LM L6V2.

This model map sentences and paragraphs to a 384 dimensional dense vector space and can be used for clustering and semantic search. It is also a 22 million

semantic search. It is also a 22 million parameter model. Now, what does that

parameter model. Now, what does that mean?

The parameter size reflects the brain power of the model. Think of parameters as the learned knowledge stored in the AI's memory. Each parameter is a number

AI's memory. Each parameter is a number that the model learned during training to understand language patterns. 22

million parameters means this model has 22 million learned values that help it understand how words relate to each other, what sentences mean semantically, which concepts are similar or different.

Let's compare that to things we already know like GPT models. Let's compare this model to GPT 3.5 and GPD4 model that we use. The 22 million parameter size of

use. The 22 million parameter size of our all mini LLM model is very small compared to the 175 billion parameters

of GPD 3.5 and 1.8 trillion parameter size of GPD4. The size of the model is proportional to that too. The all mini model is 90 mgabytes in size. As such,

it can be used locally in our laptops and the size of GPT 3.5 and 4 are 350GB and 3.6TB respectively. and thus the use

case differs. The all mini LM model is a

case differs. The all mini LM model is a perfect fit as an embedding model for our use case whereas the GBT models are used for text generation and reasoning.

So we just mentioned embeddings. What

are they actually? As its simplest form, an embedding model takes text and converts it into numbers that represent meaning. So sentence like dogs are

meaning. So sentence like dogs are allowed in the office is converted into an array of numbers known as vectors.

When you give the model a sentence like dogs are allowed in the office, it doesn't just look at the words. Instead,

it thinks about what this sentence actually means. Is it about animals? Is

actually means. Is it about animals? Is

it about workplace policies? Is it about permissions? The model then creates a

permissions? The model then creates a list of numbers that captures all these different aspects of meanings.

Each number represents something that model learned about language. Maybe the

first number captures how animal related the text is. The second number captures how workspace related it is and so on.

And it then plots that in a graph. So

dogs get a number 0.00005597 and is added to a section of the graph that represents animals and pets also fall into the same category. However,

remote does not go there. Similarly,

office falls into the workplace area.

So, our first sentence moves closer into the workplace section and so does the second sentence because it is also related to work. And the same applies to the last sentence as that's also related

to the workplace. We then compute the distance between these points. The

shorter the distance, the closer they match. So, finally, if you look at these

match. So, finally, if you look at these two sentences, you'll see that the first two are similar. That's a similarity search explained in the most simplest of forms. And this explanation only works

for a two-dimensional array. But in most cases the dimensions are too many that we can't even imagine how it would look visually. In this case the model we are

visually. In this case the model we are using uses 384 dimensions. So we don't even know how to imagine this or plot that on a graph. So then how do we calculate similarities between them?

This is where the magic of mathematics comes in. Since we can't visualize 384

comes in. Since we can't visualize 384 dimensions, we need a mathematical way to measure how close two points are in this highdimensional space. The solution

is something called the dot product.

Think of it as a mathematical ruler that can measure distance in any number of dimensions, even ones we can't see. So

here's how it works in simple terms. For the sake of simplicity, I'll convert the vectors for each sentence into two-dimensional vectors of simple numbers. So dogs are allowed in the

numbers. So dogs are allowed in the office gets a vector value of 1, 5. The

second sentence gets 2 and four and the third one gets 6 and 1. The process

involves multiplying the vectors, adding them, and then normalizing them. Let's

look at the first two. We first multiply the values in the vectors. So, we

multiply 1 * 2 to get 2 and 5 * 4 to get 20. We do the same for the other two

20. We do the same for the other two pairs.

We multiply 1 5 with 6 and 1 to get 6 5.

And then we multiply 2A 4 with 6 and 1 to get 12A 4. We then add the multiplied numbers together. So 2 + 20 gives us 22.

numbers together. So 2 + 20 gives us 22.

And we get 30 and 16 for the others. And

finally, these go through a normalization process to convert these numbers into anything between 0 and 1.

That also takes into consideration the total size of the vectors among other things. Finally, the pair with the value

things. Finally, the pair with the value closest to one are similar and away from one are dissimilar. So that's a basic explanation of how sentences are compared for similarity.

Now, of course, you don't have to do all of that math by yourself. We have

libraries that does that for you. Numpy

is a powerful Python library for working with numbers and mathematical operations. We import numpy as np and

operations. We import numpy as np and then we call the np dot method and pass in the vectors for it to calculate the dotproduct between the two vectors. It

returns a similarity score between zero and one.

So let's take a closer look at that. So

first we install the required libraries such as the sentence transformers and numpy library. So the sentence

numpy library. So the sentence transformers as we saw provides the sentence transformer class and the all mini lm model. The numpy provides the np

function for calculating dotproducts between vectors. So here we can see the

between vectors. So here we can see the complete code in action. We first import the sentence transformers library and the numpy library. Then we load the all

mini LM L6V2 model that we've been discussing. What it does is it downloads

discussing. What it does is it downloads the model, loads the 22 million parameters into memory, prepares a model to convert text into embeddings. We then

define our three test sentences about dogs, pets, and remote work. And we then encode these sentences into embeddings using the embedding model. And finally,

we calculate the similarity between each pair of sentences using NumPai's dotproduct function. Now, let's see what

dotproduct function. Now, let's see what happens when we run this code. We print

out the similarity scores between each pair of sentences. And the results are quite interesting. Looking at the

quite interesting. Looking at the results, dogs versus pets shows 73.3% similarity. That makes sense because

similarity. That makes sense because both are talking about animals in the workplace. Dogs versus remote shows only

workplace. Dogs versus remote shows only 36.2% similarity. That makes sense, too,

36.2% similarity. That makes sense, too, because one is about animals, the other is about work arrangements. Pets versus

remote shows 33.8% similarity. Again,

these are quite different topics. This

demonstrates exactly what we've been talking about. The model can understand

talking about. The model can understand semantic meaning, not just word matching. Even though dogs and pets are

matching. Even though dogs and pets are different words, the model recognizes they're both about animals in the workplace context. And it correctly

workplace context. And it correctly identifies that remote work policies are quite different from animal policies.

And this is the foundation of how rack systems are built. They can find semantically similar content even when the exact words don't match. This is

what makes rag so powerful compared to traditional keyboard search. So, so far we've been looking at sentence transformers and the all mini LM L6V2 model. But sentence transformers are

model. But sentence transformers are just one example of embedding models.

There are many other popular embedding models out there that you can choose from depending on your use case. Now,

let me clarify an important distinction.

The sentence transformers we've been using are local models. They run on your local machine. They're completely free

local machine. They're completely free and they don't require an internet connection. But there are also remote or

connection. But there are also remote or API models like OpenAI's embeddings that run on external servers where you pay use and need an internet connection. In

this sample code, you can see how we use the OpenAI library and use the embeddings API endpoint to create a new embedding. The model is text embedding 3

embedding. The model is text embedding 3 small and it returns the embedding vector for it. There's also this leaderboard of top embedding models posted by HuggingFace. We can see some

of the most popular ones here. Gemini

topping the chart with Quen 3 and others that are following. Well, that's all for now. Head over to the labs and practice

now. Head over to the labs and practice working with embedding models. All

right, we're now going to look into the second lab. This is called embedding

second lab. This is called embedding models. So, I'm just going to click on

models. So, I'm just going to click on start lab to start the lab. We'll give

it a few minutes to load. Okay, so in this lab, we're going to look at uh embedding models. We'll explore semantic

embedding models. We'll explore semantic search using embedded models which are the foundation of modern uh rag systems. So let's uh go to the first task. So the

first task is about keyword uh search limitations. So first we navigate to the

limitations. So first we navigate to the project. We create a new virtual

project. We create a new virtual environment and install the requirements.

I go to the terminal and we're going to uh set up the virtual environment.

So our project is within this uh folder called rag project. And here we have the virtual environment uh that's being set up.

Okay. Okay, we now run the um the next step once the once the virtual environment is set up, the next step is to run the keyword limitation demo. If

you go to the rag project, you'll see the keyword limitation uh demo script.

So, this is a simple script that searches for a word or keyword that does not exist in in the documents and proves that uh pure keyword based uh search uh

are less likely to yield the right results. For example, in this case, the

results. For example, in this case, the query is distributed workforce policies and the none of the documents have something that's exactly like that, right? So, let's try running the script.

right? So, let's try running the script.

And if you look at the script, most of the scores are zero because um the keywords distributed workforce policies does not really exist in any of the

scripts. So, the correct answer here is

scripts. So, the correct answer here is missing synonyms and context.

All right.

The next task is to install embedding dependencies. So we go to the rack

dependencies. So we go to the rack project. So we're already in that

project. So we're already in that project.

We source the uh virtual environment. We

install the embedding packages. So I'm

going to copy copy this command install it. So the packages are sentence

it. So the packages are sentence transformers hacking phase hub and open AI.

The next question is to run the local embedding scripts. So if you see the

embedding scripts. So if you see the script name is semantic search demo. So

let's look at the semantic search uh demo script. And if you look into this,

demo script. And if you look into this, we can see that the first step is loading the documents. And then we load the local embedding model which is the all mini LML L6V2. And then we generate

embeddings by calling the method model.

And then we pass in the docs. And then

we have the query which is the same query we used before which is distributed workflows policies. And then

we generate embeddings for the query.

And then we calculate u the similarities using the np uh method. And then we print the results. So let's uh run the

script. Twitter uv run python semantic

script. Twitter uv run python semantic search demo.

Now as you can see in the same set of documents the script has now identified the relevant uh documents uh which has the meaning that's closer to the distributed workforce policies query

that we are looking for. So if you see that for each document is uh given a rating and that means that it's able to identify the document that has the closest semantic results. We'll go to

the next question.

So the task is to uh look at the results and then say look at the semantic search results and what is the similarity score between remote work policy and distributed work policies. If you look

at the first score say 0.3982 and that is the score for remote work policy.

The next question is a multiple choice question. So this basically confirms our

question. So this basically confirms our learning. So the question is based on

learning. So the question is based on the comparison between semantic search and keyboard search which is a TF IDF and BM25 that we saw earlier. Which

approach better understands the meaning of queries? Of course we know that

of queries? Of course we know that semantic search understands uh the meaning of queries better. And that's

basically about uh this lab. In the next lab we'll explore um vector databases.

Let's now understand vector databases.

So far we saw how we could use the sentence transformer libraries and load simple sentences into it to create embeddings and then compare those embeddings to each other in a super

simple way. However, we have a bigger

simple way. However, we have a bigger task at hand. Our policy co-pilot system and it has hundreds or thousands of large policy documents. Let's say we have 500 policy documents each with

multiple sections. When a user asks,

multiple sections. When a user asks, "What's the reimbursement policy for home office setup? Our system needs to search through all of these documents to find the most relevant ones." Now, if we

were to do this the naive way, comparing the query embedding with every single stored embedding, we'd have a big problem because with 500 documents, each

with 384 dimensions, that's 192,000 calculations for every single query.

This is like searching through a phone book page by page. It works for a small phone book, but imagine trying to find a specific number in a phone book with millions of entries. You'd be there all

day. That's where vector databases comes

day. That's where vector databases comes in. Think of them as having a smart

in. Think of them as having a smart librarian who knows exactly where to look. Vector databases can retrieve

look. Vector databases can retrieve relevant results instantly. They

efficiently use resources. They're

scalable and they do that by using smart indexing algorithms. What does indexing mean? Earlier we saw how we represented

mean? Earlier we saw how we represented documents or sentences on a vector graph and then compared their similarities.

But when there are thousands of such policies, it's going to be impossible to compare them. And that's where indexing

compare them. And that's where indexing comes in. Instead of checking every

comes in. Instead of checking every single vector, we pre-organize them into neighborhoods. In this case, the animal

neighborhoods. In this case, the animal policies are grouped together. All

health benefits are grouped together.

All remote work policies are grouped together. That way, when someone asks

together. That way, when someone asks about bringing their dog to work, we don't search the entire space. we go

directly to the animal policies neighborhood and only search there.

Let's look at the three most popular indexing algorithms used by vector databases. HNSW or hierarchical

databases. HNSW or hierarchical navigable small world is the most widely used algorithm. It creates a graph

used algorithm. It creates a graph structure where each vector is connected to its most similar neighbors. So when

searching, it starts from a random point and follows the connections to find the closest matches. It's fast and accurate,

closest matches. It's fast and accurate, which is why most vector databases use it by default. IVF or inverted file and LSH or locally sensitive hashing are

other examples of the same. Let's now

look at some of the popular vector DB implementations. Chroma is perfect for

implementations. Chroma is perfect for learning because it's open- source and Python friendly. We can install it on

Python friendly. We can install it on your computer and start experimenting immediately. It's free, which makes it

immediately. It's free, which makes it great for students and small projects.

Pine cone is a managed service, meaning they handle all the infrastructure for you. You just send your data and queries

you. You just send your data and queries and they take care of everything else.

It's used by big companies in production, but you pay per use. There

are other great options too. VV8 with

its GraphQL API is another example. But

for learning, I recommend starting with Chroma. So the best approach is to start

Chroma. So the best approach is to start with Chroma for learning and experimentation and then move up to Pine Cone or similar services for production use case.

So first we install the required library such as the Chroma database. Then we

import the Chroma DBA library. We

connect to the client. We create a collection called policies. Chroma

creates a new collection in memory. Sets

up the default embedding model. The all

mini LM embedding model. Prepares

storage for vectors and metadata. We

then add policy documents to the collection using the collection add command. So this converts text to

command. So this converts text to 384dimensional vector that we spoke about earlier. Saves the vector in the

about earlier. Saves the vector in the collection, adds the vector to the hnsw index structure. The document is

index structure. The document is immediately searchable. To search, we

immediately searchable. To search, we run the collection.query method and pass the query string. Now let's talk about some important Chromb concepts. First,

the default behavior of Chromadb is that it's not persistent. When you create a client with just chromadb.client,

client, it stores everything in memory.

This means when your program stops, all your data is lost. This is fine for learning and experimentation but not for production. To make Chromb persistent,

production. To make Chromb persistent, you need to use persistent client instead of client. You specify a path where you want to store the database files. This way, your data survives

files. This way, your data survives program restarts and you can build up your vector database over time. You can

also change the embedding model that Chromma DB uses. By default, it uses the all mini LM model, but you might want to use a different model for better performance or to match what you used

during training. You can use OpenAI's

during training. You can use OpenAI's embedding models or even create a custom embedding function using any model you want. In this case, we pass in a new

want. In this case, we pass in a new parameter called embedding function that passes in OpenAI's embedding function as a parameter along with the API key.

Let's head over to the labs and gain hands-on experience.

Okay, let's now look at the uh lab on vector DB. So I'm going to start the lab

vector DB. So I'm going to start the lab now. Okay. So the lab has uh what I'm

now. Okay. So the lab has uh what I'm going to do is I'm going I'm just going to go through a high level overview of the lab and I'll leave you to do most of it but I'll just explain how the lab functions. Right. So uh in this lab

functions. Right. So uh in this lab we're going to learn how to scale the semantic search with vector databases.

So let's get that going.

So the first task is to simply understand the uh concepts. So before we start building, let's uh understand what vector databases are. So we already discussed that in the video, but here's

a quick uh description of what it is and what it can help us do. And there's a question on um what is the primary advantage of using a vector database or strong embedding models um in memory. So

I'll uh let you answer that uh yourself.

The next step is to navigate to the project directory which is right here.

And then um we again activate the virtual environment and we install the embedding uh model package which is sentence transformers which we also did in the last lab. And then the next step

is to install the vector database. In

this case we're going to use chromb. So

um the task is to install the chromb package.

Again I'll just skip through that for now. Uh the next task is to initialize a

now. Uh the next task is to initialize a chromb vector database. So um if you go here there's a script called init vector DB and if you look into the script we

first import the chromb package. We also

have the sentence transformers. Um we

then uh create uh the chromb client using the chromadb.client method. And

then we create a collection. We'll call

it techcorp docs. And then uh we load the embedding model which is all mini LM L6 model. And then we um

L6 model. And then we um test the model with a sample document.

So we have identified a test doc which is really just a sentence that's given here. Um we'll then add the test

here. Um we'll then add the test document to the collection using the collection add method and we'll print the results and then uh we'll print the count of uh documents within the

collection. And that's basically it. So

collection. And that's basically it. So

that's a a quick uh beginner level uh script.

In the next one, there are a couple of questions that are being asked. So uh

you can answer those questions based on the results of the script. The next one is uh called as store documents. So this

is where we store actual documents within the chromob uh database. Again,

this is another script that starts off and loads the model and client as we did before. uh but in this case we're

before. uh but in this case we're reading the tech corp docs um documents using the tech corp docs method which we have in the utilities function. So

that's what loads all the uh documents that are in the tech corp uh docs folder. So now we're loading actual

folder. So now we're loading actual documents and then we follow the same approach of adding those documents to the collection and then we verify the collection. So

again just uh another uh layer to that uh script the basic script. In this case we're just storing documents.

We'll continue to the next task. This is

where we do perform uh a vector search against the documents. So um the script this time is vector search uh demo. So

click on the vector search demo script.

And here we have uh some sample documents um there are sentences and then there's a query. Let's now

understand chunking. Now that we understood how vector databases work, we have a new challenge. We've been working with simple sentences like dogs are allowed in the office on Fridays. But

what happens when we have real policy documents? What if we have a 50page

documents? What if we have a 50page employee handbook that we want to add to our vector database? Let's think about this practically. We have an employee

this practically. We have an employee handbook, 50 pages of policy content, multiple sections per page, complex policies with detailed explanations.

What happens when we try to add this entire document to Chromob as a single entry? Well, it would work. Technically,

entry? Well, it would work. Technically,

Chromob would create an embedding for the entire document, but when someone asks what's the remote work policy, they'd get back with the entire 50page

handbook. That's not very helpful.

handbook. That's not very helpful.

This is what I call the precision problem. Without chunking, when someone

problem. Without chunking, when someone asks what's the remote work policy, they get the entire 50page handbook. The user

gets overwhelmed with irrelevant information. They have to search through

information. They have to search through everything to find what they actually need. But with chunking, we break that

need. But with chunking, we break that handbook into smaller focused pieces.

Now, when someone asks about remote work, they get back just the specific policy sections that are relevant. The

user gets exactly what they asked for.

clear focused answers. Now, how do we actually break documents into chunks?

There are several strategies, but we'll focus on some of the simplest ones. With

fixed size chunks, we simply take 500 characters per chunk. This is simple and reliable for most use cases. We just

split the document into equals sized pieces, which makes it easy to understand and implement.

But there's a problem with this approach. What happens when we split

approach. What happens when we split right in the middle of a sentence? We

might end up with dogs are allowed in one chunk and on Fridays in the other.

This breaks the meaning and makes it hard for the system to understand the complete information. That's where

complete information. That's where overlap comes in. We add a 50 character overlap between the chunks. So the end of one chunk overlaps with the beginning

of the next. This way if we do split the sentence, the important context is preserved in both chunks.

Now there are other methods of chunking like sentencebased chunking. This is

where every sentence is split into a separate chunk or paragraph based chunking where each paragraph becomes a single chunk. Chunking might sound

single chunk. Chunking might sound simple but it's actually quite tricky.

The main challenge is finding the right balance. If chunks are too small we lose

balance. If chunks are too small we lose context. So as we saw earlier, if one

context. So as we saw earlier, if one chunk has docs are allowed and on the other chunk has on Fridays, the user would get incomplete information. We'd

have poor understanding because we're missing important details and the information would be fragmented.

On the other hand, if chunks are too large, we have poor precision. If we put an entire policy in one chunk, we're back to the same problem we started with. The search would be inefficient

with. The search would be inefficient because there's too much irrelevant content and the results would be overwhelming. So it's important to

overwhelming. So it's important to choose the right strategy based on your requirements. Apart from the fixed size

requirements. Apart from the fixed size chunking, there are other methods like sentencebased chunking and paragraph that we saw, but even others like semantic chunking and agentic chunking

that is for now out out of scopes of this video. Now let's build a simple

this video. Now let's build a simple chunking function. This function takes a

chunking function. This function takes a document and splits it into overlapping chunks. The key features are it tries to

chunks. The key features are it tries to break at sentence boundaries when possible. It maintains the overlap for

possible. It maintains the overlap for context and it handles the end of the document properly. Now this is a simple

document properly. Now this is a simple chunking done by a Python library. Now

let's see how chunking integrates with our vector database. The complete

workflow is that we chunk our large policy document, add each chunks to the vector database with a unique ID and then when we query we get back with the specific chunks that are most relevant.

This gives us the best of both worlds.

We can handle large documents, but we get precise, relevant answers. Instead

of searching through entire documents, we are searching through focused chunks that contain exactly what the user is looking for. Let me share some key

looking for. Let me share some key principles for effective chunking. For

size guidelines, 200 to 500 characters is a good balance of context and position. With 50 to 100 characters

position. With 50 to 100 characters overlap to maintain continuity, you might need to adjust based on your content. Technical documents might need

content. Technical documents might need different chunk sizes than general policies. For boundary rules, always try

policies. For boundary rules, always try to split at sentences to maintain grammatical integrity. Avoid midword

grammatical integrity. Avoid midword breaks to keep words intact and preserve paragraphs to maintain logical structure. Finally, always test the real

structure. Finally, always test the real queries to ensure your chunks actually answer questions. Verify that the

answer questions. Verify that the overlap reserves meaning and monitor your search results to see if you need to adjust the chunk size. Remember,

chunking is all about finding the right balance between context and precision.

It's not just about breaking documents into pieces. It's about breaking them in

into pieces. It's about breaking them in a way that makes sense for your users.

All right, let's look into the next lab on document chunking. Okay, in this lab, we're going to look at uh chunking techniques. So, we'll learn how to

techniques. So, we'll learn how to optimize rack performance by breaking documents into focused searchable chunks.

So, first we activate the virtual environment. So, this is something we

environment. So, this is something we have uh already done many times.

All right. Uh, so first we're going to look at this chunking problem demo script. So if you expand the rack project, there should be a script called as chunking problem demo

script. The thing is uh this script

script. The thing is uh this script demonstrates the core problem of searching a large documents in rack systems. It creates a sample employee handbook and shows how searching for specific information like u internet

speed requirements returns the entire document instead of just the relevant section. So uh we'll see uh a large

section. So uh we'll see uh a large document stored as a single chunk search queries that should be that should find specific sections or results that return the entire document. So here you can see

there's a sample document um that has multiple sections and uh we're adding that document to the uh collection chrom and then we're doing a query for

internet speed requirements. So let's

run the script and see how it works.

So the script runs now. As you can see, it returns the entire document. It's

truncated here, but uh the result shows the uh entire document. So that's the problem uh with this uh approach. So the

answer to this is large documents return irrelevant uh results. Next uh we will look at some of the uh dependencies, libraries and dependencies that we'll be

using. So first um we have what is known

using. So first um we have what is known as lang chain. So if you uh don't know what lang chain is, we have other videos that are on our platform. We have a future course that's coming up that will

be for lang chain end to end. So do

remember to subscribe to our channel to be notified when it comes out. So lang

chain is a powerful framework for building rag applications. It provides

recursive character text splitter for smart uh document chunking and there's also the uh spacy which is an advanced natural language processing library and it provides uh it also provides a spacey text splitter for sentence aware

chunking. So we'll use spacy for

chunking. So we'll use spacy for sentence um aware chunking and it uh these libraries take care of chunk sizes overlaps operators etc. And we'll

install the lang chain and spacey dependencies. Okay, we'll go to the next

dependencies. Okay, we'll go to the next question and we'll first look at basic chunking.

So if you open the basic uh chunking script, you'll see that it uses the lang chain uh text splitter um from which we have the recursive character text splitter uh library. So

here we have a sample document and uh this is where we are doing the splitting. So as you can see we specify

splitting. So as you can see we specify the chunk size 200, the chunk overlap is 50. So that's the uh 50 characters

50. So that's the uh 50 characters that's going to be overlap between each chunks or some of these se separators that are defined.

So we then do a splitter.split text to split the text into different chunks and then we have we just go through the chunks and print them.

So I'll let you do that yourself. We'll

go to the next one and there's uh a bunch of questions uh that are asked that you can you have to read the script and understand and answer. So I'll let

you do that uh by yourself.

The next one we'll look at is sentence chunking. So in sentence chunking again

chunking. So in sentence chunking again uh if you look at the script we're using spacy as a library. And then we have um a question that's based on the output of

that script. And then finally we looked

that script. And then finally we looked at chunked search. So this is a another script that performs a chunked vector search uh demo that kind of connects

everything we have uh learned so far together. So first we chunk the

together. So first we chunk the documents and then we add these chunk documents to a collection uh and there's comparison between chunked no chunking a collection with no chunking and

collection with chunking and then uh we'll see the difference between the two. Again I'll let you u go through

two. Again I'll let you u go through that by yourself and there's a question that's based on that. So, yep, that's u a quick lab on chunking and I'll see you in the video. Let's now bring it all

together to build our rack system. Now

that we understand all the individual components of racks, that's retrieval, augmentation, and generation. It's time

to see how they all work together in a real system. We've been building our

real system. We've been building our policy copilot system piece by piece.

But what does it look like when everything is connected and running in production? So, we know the basic flow.

production? So, we know the basic flow.

User query goes to retrieval, then augmentation, then generation, and finally response. But this is just the

finally response. But this is just the highle view. In a real system, there are

highle view. In a real system, there are many more components working behind the scenes to make this happen smoothly, efficiently, and reliably. Now,

everything we spoke about so far, such as chunking, creating embeddings, storing it in vector DB, etc. are things that need to be done before the user

starts asking questions. because loading

thousands of documents, chunking and storing them in DB and creating embeddings out of them and scoring them all of that takes a lot of time and so

they go together before this stage called as a rag pipeline. Let's take a closer look at that simple rag pipeline.

The rag pipeline gets the policy documents, chunks them into small pieces using a chunk size of 500 with an

overlap of 50 characters, then converts them into embeddings using OpenAI's embedding models and then finally loads them into a vector DB.

Now, when a query comes in, we search the rag pipeline and it gives us the necessary chunks of document. We then

augment that document along with the user's query and sends that to the LLM to generate a response. So that's a super simplistic rack pipeline. Let's

head over to the labs and see this in action. All right. So this is the last

action. All right. So this is the last lab in this course and this one is about building a complete rag pipeline. So uh

we'll learn how document chunking integrates with vector search, how query processing connects to retrieval, how context augmentation feeds into response generation, and how the complete rag pipeline works end to end. So this

basically combines everything from the first four labs that we've just done.

All right. So first we start with setting up the virtual environment. So

the environment is already set up. You

just need to activate it.

All right. So, first we start by looking at the complete rack demo script. So, we

have a single script now that combines everything we've done so far.

And uh we'll start looking at it uh section by section. So, there's the first section that has the document loading and chunking. And there's the a function for that. We have some sample

documents. And then we have a text

documents. And then we have a text splitter. And we have all the uh chunks

splitter. And we have all the uh chunks that are created here. And then we have section two which is a vector database setup. Here you can see we set up a

setup. Here you can see we set up a chromb vector database and store the document chunks there. And then we have the uh user query processing section.

This is where we actually process the user queries. And then we do the actual

user queries. And then we do the actual search. And finally we have the context

search. And finally we have the context augmentation. This is where we build

augmentation. This is where we build augmented prompt with retrieved context for LLMs. And so here you can see how uh a prompt is generated with the uh context in

place which is the basically the policies that were retrieved and then you have the actual question the user's question itself and some additional uh prompt engineering and then finally we have the generate

response function that generates a response using LLM and finally we have the complete rag pipeline that calls each of those

functions. funs that we have written

functions. funs that we have written before and then there's the main function.

Well, I'll let you explore this uh lab by yourself. There's a lot of uh

by yourself. There's a lot of uh interesting questions and challenges throughout.

This section covers the essential production concerns. Caching to make

production concerns. Caching to make systems fast, monitoring to know what's happening, and error handling to keep systems running when things go wrong.

Let's start with a fundamental problem.

Rag systems are slow. Every query

involves multiple expensive operations.

Generating embeddings, searching vector databases, calling LLM APIs. Without

optimization, a single query can take nearly a second. But here's the thing.

Most queries are repeated or are very similar. People ask the same questions

similar. People ask the same questions over and over. What's the reimbursement policy for home office setup? Gets asked

dozens of times. Caching solves this by storing the results of expensive operations and reusing them. Instead of

taking 950 milliseconds, a cache response might then just take just 5 seconds. That's 190 times faster. The

seconds. That's 190 times faster. The

key insight is that we don't need to recomputee everything for every query.

We can cache at multiple levels, the embeddings, the search results, or even the final answers. So there are four main types of caching that we can implement in rag systems. each solving a

different performance bottleneck. Query

cache is the simplest. We store complete question answer pairs. When someone asks what's the remote work policy, again, we return the exact same answer instantly.

This works great for frequently asked questions. Embedding cache stores the

questions. Embedding cache stores the computed vectors for text. This is

useful because generating embeddings is expensive and we often process the same text multiple times like policy chunks that appear in multiple searches. Vector

search cache stores the results of database queries. This helps when

database queries. This helps when similar queries return the same results.

Remote work and working from home might return identical chunks. Llm response

cache stores the generated answers. This

is the most expensive operation to cache but also the most valuable since LLM calls are typically the slowest part of the pipeline. The key is to cache at the

the pipeline. The key is to cache at the right level, not too granular, not too broad, and with appropriate expiration times. Let's look at how to actually

times. Let's look at how to actually implement caching. Well, Reddis is a

implement caching. Well, Reddis is a popular caching tool because it's fast, supports different data types, and has built-in expiration. The example shows a

built-in expiration. The example shows a simple but effective caching strategy.

We create a unique cache key by hashing the query and context together. This

ensures that different queries can get different cache entries, but similar queries can share the same entry. We

check the cache first. If we find a cache response, we return it immediately. If not, we generate the

immediately. If not, we generate the response using our normal rack pipeline, then store it in the cache with an expiration time. The TTL or time to live

expiration time. The TTL or time to live is crucial. We want to cache to we want

is crucial. We want to cache to we want to cache long enough to get performance benefits, but not so long that the data becomes stale. For policy documents and

becomes stale. For policy documents and R might be appropriate for more dynamic content, we might use shorter times. You

can't manage what you don't measure. In

production, we need to monitor everything to understand how our rag system is performing and when problems occur. The best metrics are response

occur. The best metrics are response time, how fast we answer questions, throughput, how many queries we handle per second, error rate, what percentage of requests fail, but rack systems have

their own specific metrics we need to track. Retrieval quality measures how

track. Retrieval quality measures how relevant the return chunks are to the user's question. Embedding performance

user's question. Embedding performance tracks how long it takes to generate vectors. Chunking efficiencies monitors

vectors. Chunking efficiencies monitors how well we're breaking up documents. We

set alerting thresholds to know immediately when something goes wrong.

So if response time exceeds 2 seconds, there's uh there's a performance issue.

If error rate goes above 5%, then there's a system problem. So the key is to set realistic thresholds based on actual performance, not theoretical targets. So we want alerts that indicate

targets. So we want alerts that indicate real problems, not false alarms that cause alert fatigue. Now things will go wrong in production. Vector databases

will go down. Llm services will be unavailable. Networks will have timeouts

unavailable. Networks will have timeouts and we need to handle these failures gracefully. The goal is graceful

gracefully. The goal is graceful degradation. The system should still

degradation. The system should still work even if not at full capacity. So

users should get some answer rather than an entire error message. So the example uh shows a cascading fallback strategy.

If the full rack pipeline fails, we try keyword search. If that fails, we return

keyword search. If that fails, we return the retrieved chunks directly. If even

that fails, we use simple text matching.

And as a last resort, we return a helpful error message. And we

periodically test if the service is back by sending a few requests. And this is uh the halfopen state. So if those succeed, we close the circuit and resume normal operation. Let's now bring it all

normal operation. Let's now bring it all together to build our rag system. Now

that we understand the core rack architecture, we need to talk about what happens when we put these systems into production. Real world rack systems face

production. Real world rack systems face challenges that don't exist in our simple examples. Performance issues,

simple examples. Performance issues, failures, and the need to handle thousands of users. So this diagram shows a complete production rack system running on Kubernetes. And let me walk

you through each layer. So we have a data layer, a rag pipeline layer, and the application layer, and a monitoring stack. So the data layer includes all

stack. So the data layer includes all our storage systems. So Chromad for vectors, Red is for caching, PostgresQL for metadata. The rag pipeline layer

for metadata. The rag pipeline layer contains the core rack functionality broken down into microservices. So query

processing chunking embedding generation, retrieval, augmentation and generation. And each service can scale

generation. And each service can scale independently based on demand. The

application layer contains all the userfacing services. So there's the web

userfacing services. So there's the web UI, there's the mobile app back end if there's any the admin interface etc. These services handle users interactions and present the rack capabilities

through different interfaces and then we have our complete monitoring stack.

Prometheus for metrics, graphana for dashboards, Jerger for tracing and the ELK stack for logging. Now this layered architecture separates concerns clearly.

Applications handle user interactions.

The rag pipeline processes the core functionality and the data layer provides storage. This can handle

provides storage. This can handle thousands of concurrent users while maintaining high availability and performance. Well, that's a highle

performance. Well, that's a highle overview. We haven't spoken about a lot

overview. We haven't spoken about a lot of advanced topics like multimodal rack, graph rag, hybrid search techniques, federated rack, reranking techniques,

query expansion, context compression. To

learn more about AI and other uh related technologies, check out our AI learning path on CodeCloud. Well, thank you so much for watching. Do subscribe to our channel for more videos like this. Until

next time, goodbye.

Loading...

Loading video analysis...