20 AI Concepts Explained in 40 Minutes

By Gaurav Sen

Summary

## Key takeaways - **LLMs Predict Next Token**: A large language model is a neural network trained to predict the next token of an input sequence. For example, inputting 'all that glitters' leads to predicting 'is not gold'. [00:38], [01:09] - **Tokenization Breaks Subwords**: Tokenization processes input by breaking text into discrete tokens like 'all', space, 'that', 'glit', 'ters'. Suffixes like 'ers' or 'ing' help the model understand actions and meanings in natural language. [01:32], [02:09] - **Attention Resolves Ambiguity**: Attention uses nearby words to disambiguate terms like 'apple', pushing its vector toward fruit with 'tasty' or company with 'revenue'. This breakthrough in 2017 made LLMs popular by deriving contextual meaning. [04:43], [06:07] - **Self-Supervised Scales Cheaply**: Self-supervised learning creates training data from existing text by masking tokens, like predicting after 'et tu' as 'Brute', without human labels. This makes training scalable as models generate their own puzzles from internet data. [07:41], [10:06] - **RAG Fetches Relevant Context**: Retrieval Augmented Generation retrieves relevant documents from a vector database via similarity search and augments the query before sending to the LLM. For 'upset with payment, expect refund', it finds docs on low ratings or drop-offs semantically close to 'upset'. [18:13], [21:48] - **Distillation Shrinks Models**: Distillation trains a small language model (3-300M parameters) to mimic a large one (3-300B parameters) on the same inputs, condensing knowledge for faster, cheaper inference while retaining performance for specific tasks. [40:24], [40:51]

Topics Covered

Attention Resolves Word Ambiguity
Self-Supervised Scales Without Labels
Transformers Stack Contextual Layers
Context Engineering Evolves Prompts
RLHF Reinforces Without Mental Models

Full Transcript

Hi everyone, this is GKCS.

In today's video we will see some of the commonly used terms in the AI space.

If you are an engineer who is building applications, then you will find these terms useful.

When communicating with people within your team or outside.

And I think if you know these terms, then it is also easier to learn the deeper subjects around AI.

So by the end of this video, you'll have a list of terms whose definitions you understand quite well.

And I'll also be linking some references in the description so that you can dig into them further.

Let's start.

The first term that you should know about is large language model.

Also known as LM.

And the definition of this is a neural network.

That is trained to predict the next term.

Of an input sequence.

For example, if I pass in the query all that glitters to a large language model, then it's going to come up with the response of is not going okay.

At which point the complete response of all that glitters is not gold is returned to the user.

What do we mean by training?

What do we mean by neural network?

As we go through this video, you will be understanding these terms better one by one.

Okay.

The second term that we're looking at is tokenization.

This has to do with processing the input of a large language model.

For example, if all that glitters is passed into a large language model, the first thing it's going to do is break this into discrete tokens.

That is the process of tokenization.

The first token will be all.

Then there's a space character.

You have then that after which you have glitched.

And finally also you might think, well, why shouldn't you just break this into space characters and get the job done?

The humans do not talk like that.

We are, after all, trying to process natural language.

So ours is a common term.

Shimmers. Murmurs. Flickers.

These are terms which have the suffix of ers which means that the action of glitched is being performed by that object.

Another example of this is in.

So eating, dancing, singing all have the suffix of ING, and a large language model can look at this token of ING and know that the previous action is being performed.

As long as you have the suffix.

Okay, remember, the core problem for the large language model is to truly understand human language so that it can speak it really well.

Tokenization is an essential part of that.

Whose end result is that the input text is broken into tokens.

Which brings us to our third term vectors.

Tokens tell you what you should focus on.

What is the smallest term that you can derive meaning from?

But what meaning has to be derived is represented by vectors.

If the large language model can map a two dimensional or a n dimensional space.

Such that all the words which are close in meaning are placed close to each other, then the benefit will be that the meaning of these words will be turned into a coordinate.

In this n dimensional space.

This is called a vector.

Okay.

The coordinate.

The mapping of a word in a n dimensional space such that.

Nearby words.

Similar meaning words are all clustered together and opposite meaning words are somewhere far away.

Comes through the process of vectorization.

The end result of this is that large language models know the inherent meaning of all the words that are in the English vocabulary, and they also know how to break it into small tokens.

Any input text into tokens.

Words which are similar to each other are placed close to each other.

Once they know the meaning, they can construct sentences effectively.

Okay, so now you have large language models which can tokenize input text, convert them into vectors.

But there is one major challenge which actually change the entire industry here, which made large language models very popular.

And that is attention.

We just said that all the input tokens for a large language model are converted into vectors.

The vectors encapsulate the meaning of those words.

But what about the word apple when you say it is a tasty apple, you mean the fruit, the edible apple?

When you say apples revenue, you probably mean the company.

And if you say the apple of my eye, you are probably talking about a young person who you have affection for.

So Apple has different meanings, and the only way to understand the meaning is not by looking at the word itself, because that spelling is the exact same, but by looking at nearby words which add context to the meaning of apple.

The moment I said tasty, you know that it's some sort of food that is going to talk about.

That's how humans derive meaning, and large language models can derive meaning this way.

Now, the way they do this is look at nearby words in a sentence.

Generate those vectors so nearby contextual vectors are picked up.

And for ambiguous terms you end up with ambiguous vectors.

But you can derive the exact meaning by adding this nearby contextual vector to it.

So take the vector of Apple.

Take the vector of revenue when you add these two vectors.

When you perform some sort of an operation, it's not a direct addition, but it's the attention operation.

You effectively take the vector of Apple and you push it in the direction of the company Apple.

So Google meta, Microsoft are all here.

The first operation of vector revenue is going to send it there.

If you instead add a vector of tasty, do this.

If you perform the attention mechanism of these two vectors, then it's going to push the vector of apple to banana, chiku and guava.

Okay, so you can tokenize input text.

You can derive the inherent meaning of all of those tokens.

And for ambiguous tokens, for tokens which are difficult to understand.

You have a mechanism to add context by looking at nearby words.

And this is another breakthrough that large language models have made.

This was in 2017.

The paper came out then, but in 2022 this became really, really famous which are gpt2 being released.

The quality of responses of a large language model far exceed anything else that we have seen earlier.

Okay, because it is able to derive contextual meaning, it's able to construct sentences in a way that humans speak.

Okay, so now we know how LMS can process input.

But how do you train them to predict the next token?

Okay, here's where there was a major breakthrough in 2017.

Basically the concept of self-supervised learning.

Became very popular.

Self-supervised learning means that instead of telling the model exactly what it needs to do, the structure of the input data is such that the model knows what it should do.

Okay.

For example, you're watching this video right now.

I'm going to make a part of this video blank.

So 12345.

What do you think is being hidden right now?

What number is coming to your mind?

Let's see if that is right.

Yes, most of you guessed one because we went in the sequence five four three two one.

Okay.

But when it comes to a video, you can also do something else.

Let me make another part of the video blank right now.

Where do you think the other AI is looking?

Let's check.

Most of you got it right.

Both eyes are looking upwards.

So what's happening is a section of the input can be predicted.

Even if you make that section blank, which means that there is inherent structure.

In your input which your mind is able to replace with the expected token or expected output.

Now, the standard way to train such a model would be called supervised learning, where you would have a human being say that if the input text is all that glitters, then the model should predict is not gold.

If the input text is at two, then the output should be Brutus instead.

Self-supervised learning has made getting test data much cheaper here.

If you have a two Brutus, then the model is going to be fed in this text and it's going to make three predictions. One,

what comes after it?

Two what comes after a two and three what comes after it?

Two Brutus okay, no, humans are involved.

You had some text in the world.

Maybe you scraped this off the internet and now you're taking the model.

Look, I have three questions for you.

Tell me, what are the right answers?

So the model looks at these three puzzles.

They are all running in parallel, and they try to make predictions.

So it the model might say now the model might say two.

The model might say something, but you train the model that two is the expected response.

So if it makes a mistake then you penalize the model that increases loss.

And so the neural network weights are updated.

In the second task you have at two, if the model makes the prediction of Brutus, then you tell the model that this is great.

The weights don't need to be updated.

But if it says Caesar, then the model has to be penalized.

And so the internal weights are updated.

In the third case, if you predict a stop token like add to Brutus, that's it, then you will get it wrong.

If it is a comma, you're right.

And if it's, then maybe you're also right.

Okay.

What you're doing is you are looking at text, which already exists in the world, and you're creating multiple challenges for yourself without human intervention.

This is what makes the model self-supervised.

It might seem like a small thing, but this architectural decision or this benefit of the large language model makes it really, really scalable.

In fact, most AI models now are moving to self-supervised learning.

Even image models like we discussed, are looking for removing some patches of the image and trying to predict those patches.

The benefit of this is you understand the underlying structure and the inherent meaning of those patches.

In the case of text, it's going to be terms. In the case of images, they are a bunch of pixels.

And in the case of video you might understand how an object even moves.

Okay.

So that explains what self-supervised learning is.

Next is the transformer okay.

And most people confuse transformer with large language model, which is completely understandable actually.

But that's not the case.

A large language model is something which predicts the next token given an input sequence.

A transformer does the exact same thing, but it's a specific algorithm or a specific method by which you predict the next token.

A transformer basically is input tokens.

Being run through an attention block, which is then forwarded to a neural network, a feedforward neural network, and then you have a bunch of outputs.

Okay, you can think of these as output vectors.

These vectors are then passed in to another layer of attention.

The first layer of attention, like we said, disambiguate terms. The second layer might find more complex relationships.

It might find sarcasm. It might find implications.

For example, a crane was hunting a crab.

So in the first case you understood it is not the metal plane, it's a bird train.

But in the second one you might infer that the crab is fearful.

You might understand the crane is hungry.

So this is the second layer.

And then you have another feedforward neural network and so on.

Till finally you are confident enough to generate an output.

Okay. So you have these stacked.

Sometimes they're stacked to 12 layers, sometimes more.

I think recent GPT architectures are in hundreds.

The main idea behind this is are getting all of the meaning from your input tokens and then manipulating them again and again to finally predict what the next word should be.

This attention lock is order in square.

Okay.

You could replace this transformer in a large language model with something else to model.

A new architecture could come in, in which case the transformer and the state space models are gotten rid of, which could be a diffusion model.

That constructs essays or text.

Okay, so the large language model is actually the product.

You can think of it as a car.

And this is the engine.

A car, many people say is just the engine.

But no, there are some other fancy things around it.

The internal algorithm can be different.

This term number seven, it's fine tuning.

We said that a large language model is something that is trained to predict the next term.

Of an input sequence.

The question is what type of next token are we talking about?

If you are talking about a medical large language model, something which helps doctors explain the diagnosis of a patient, then you're probably going to be thinking of medical terms. If you have a model which is trained on financial operations.

Then the same model for the same query is going to think in terms of financial terms. So the next token that the model comes up with is not always going to be general.

You're first going to train your base model.

In a self-supervised fashion.

Then you're going to take that model and make it go through a series of questions and answers.

This process is called fine tuning.

And goes something like who is the president of USA?

Donald Trump?

But the model could also say, I would like to know that too.

Here's where things are going wrong okay.

The model should not be responding like this.

Give us a direct answer or confess that you do not know.

Or you could say no.

But then this is also very, very bad because the models are trained to be helpful.

Okay, so what's happening is other plausible responses which are not wrong but are not desirable, are penalized in the fine tuning process.

You have these questions and answers.

The fine tuning process forces the model to take a question and give answers.

As expected.

So when it comes to a medical diagnosis, the model is going to train itself.

The internal weights will be updated in such a way that it will learn to speak in medical jargon or medical terms. And so this step, where a base model is trained to answer in a specific way, is called fine tuning.

The same base model can be run through different sets of questions answers to come up with multiple fine tuned models.

So the base model of Lamarck can be fine tuned by a company to answer that customer's specific queries.

A few short prompting.

So the main idea behind future prompting is before you send a query to a model, before you send a plain vanilla query to a large language model and ask it to come up with a response.

You augment the query.

You add more information by saying, look, if the query is where is my parcel?

Then let me tell you that there are some examples that I want you to go through.

This is happening during inference time during response time in production.

Right?

Life, your system, your server sends the original query and sends examples to the model so that it takes this into context and then gives an appropriate response.

The quality of the response goes up.

This is called future prompting.

It's basically example prompting example in prompt.

That's it.

It brings us to point number nine which is very interesting and is completely exploded, which is retrieval, augmented generation.

In fact, the AI space is moving so quickly that people are saying rank or retrieval augmented generation is already dead.

So the basic idea again, is that you have a large language model and you pass in the input from the server.

So a customer connects to you here they hit your API.

The server says, you know what this is the customer query that is forward that to the language model.

Along with that let's give some examples.

So that's for short prompting.

And along with that, since there are some company policies that I want you to know of, last language model, I'll give you those documents.

So in real time the server goes fetches the most relevant documents.

Maybe your policy document, maybe your terms and conditions.

When placing an order, and maybe many more things.

Right?

You send these documents along with examples of how you should respond.

This gives you a good idea of the format of the response.

This gives you a good idea of the company specific context, and this is the direct user input query.

Okay, with all of this, the large language model tends to give very high quality responses.

Now the question is where are you getting these documents from?

How does the server know which documents are related to which query?

There are many ways to do this.

If you talk to Neo Forger, which is a graph database company, they will tell you you should store things in a graph. DB.

If you talk to neon, then they will tell you that you should store things in a vector DB and some people will say just keep everything in memory.

Just keep everything in cache.

This doesn't matter how you fetch the documents, doesn't matter so much.

Usually it's a vector DB by the way, because.

But I mean it is easier to find relevant documents because you just do a similarity search.

Once you have the documents, you pass that to a large language model.

The large package model converts it internally into vectors and then gives you a response.

Okay, but at a high level you just want to add more and more context.

You retrieve the context, augment the query, and then generate a response.

The 10th term, which is vector database.

We just mentioned vector database is something which is used to find relevant documents for an incoming query.

Let's see how that happens.

You have the request.

I am upset with your payment system.

I expect a refund.

This is a lot of terms in this query.

A human being can read this and easily understand what the user is feeling.

They are feeling upset.

I mean they've already mentioned it, but they are looking for a refund if you give them a refund, maybe the upset feeling will go away.

What do you do?

Which documents do you search for?

You could search for all documents where the word upset exists, but maybe you do not have it in your company policy.

Maybe nowhere is it mentioned that a user is upset, but you have a document which mentions if the user is giving you a low rating, or if a user drops off.

How do you make that decision that upset as a word, is close to the low rating or drop off?

We spoke about vectors.

Vectors can encapsulate semantic meaning, which means documents which store similar words are going to be similar or close in distance.

Remember, vectors are basically coordinates, right?

So the distance between upset and documents having low rating are going to be low.

You will fetch the documents which mention low rating or drop offs and use them to add context to your large language model.

When you have an incoming query from the user.

You're going to find which document is closest to the query and add that to the large language models context.

So this document will be sent along with the original user query and maybe a system prompt.

Where are you going to store these documents in a vector database, which helps you perform these similarity searches efficiently.

Some of these algorithms are hierarchical, navigable, small world.

We have spoken about this in detail in the interview.

Right course at the end of the day.

The vector database is like a black box to you.

You can store documents and you can quickly retrieve them when you need them.

Great.

So you can store internal company documents and information in a vector database to get context for a large language model.

But what if the context exists outside your system?

So this challenge was met with model context protocol.

Okay.

As the name suggests, it's a protocol or a way to communicate the transfer context into a model.

The basic idea here I made a detailed video on this.

You can check it out, but the basic idea here is that you have a large language model which, before receiving an incoming query from a user.

Has a client, an MCP client model, context protocol client which forwards the initial query user query.

The LLM now makes a decision.

It says that there may be external tools or databases that I want to connect to.

The client gets to know of this and connects with external MCP servers.

In one case, that might be Indigo.

In another case that will be Air India, whose MCP server can give you details around Air India.

So you can think of this as a wrapper for Air India's database.

This as a wrapper for Indigo's database.

As a response, you are going to get flight details.

From each of these airlines.

Once you have the details, you can forward it to the alum saying that hey, along with the user query and along with whatever system from the relevant context that I could get from my vector database, I'm also adding flight details, real time information from external servers, which you can now consume to come up with a decision.

Okay.

And the large language model at this point might say, okay, book flight number i.e. Indigo

i.e. Indigo

1020, which then results in another API call to book on the MCP server of Indigo.

Okay.

The response final response is given to the MCP client.

The client then forwards it back to the user.

Result in customer satisfaction.

Okay.

You see that the user is no longer just able to get data up.

They do not have to do things themselves after being given the recipe.

The recipe can be completely executed by the MCP client.

Okay, so this makes LMS a lot more powerful.

MCP has picked up a lot of popularity now.

Okay, so all of this put together is called context engineering.

If you are an engineer, you have probably heard of this term.

And basically this is an encapsulation of many of the things that we have already discussed.

We discussed a few short prompting, which is giving examples.

We discussed retrieval, augmented generation, which is getting relevant documents from a vector database.

And using them to add context to a query and using model context protocol to hit external servers.

And perform actions as needed.

When it comes to context engineering.

This two new challenges that we are facing as engineers.

One is user preferences and the second is prompt summarization.

You can call it context summarization.

For example, you might use a sliding window.

Where the last 100 chats are sent directly to the large language model, and all the previous chats are summarized into five sentences, just.

This limits the max amount of chats that you are sending to the large language model.

You could use other techniques also.

For example, some people just focus on keywords.

Some people focus just on the last chat.

So one chat and the previous entire history summary together.

The idea is to get context summarization this way when you get a document, you again summarize it first and then send it.

So this can be done maybe using a cheap small language model or a distilled model.

And once you have generated the context, you send that to the expensive large language model.

You see, the main difference between context engineering and prompt engineering is prompt engineering is for one single prompt.

It is stateless.

Anytime you ask the large language model to behave in a particular way, the system prompt is going to be the same.

But context engineering evolves as per the user's declared preferences and also the previous chat history similar to what it was earlier, but this is more long term.

Which brings us to the most long term thing you can come up with in the air space right now.

Agents.

I've taken a detailed video on this, so do check that out.

But at a high level, you have a long running process.

Which is known as an agent.

You can think of this is a server which is getting an API call.

And this has many capabilities.

It can go and query and LM.

It can also query external systems. And other agents.

To meet the user's requirements.

Let's take an example here.

Let's say your travel agent can look into booking flights, booking hotels and even manage your email when you're away.

When it sees a window of opportunity.

Maybe the flights then are cheap.

It goes ahead and makes the booking according to your preferences.

All of this stuff can be managed by an agent and the most hyped term here.

Is reinforcement learning.

It's a way in which you can train models to behave in a particular way.

So, for example, if you give a query a user query to the model, the model can generate two responses response one and response two.

You must have seen this in ChatGPT.

Choose the one which is better.

Okay, so the one which is chosen gets a plus one.

The other one gets a minus one.

What happened effectively is you took a user query.

This entire thing can be mapped to a vector.

And the vector is an n dimensional space.

So you go to that coordinate and you tell the model that look after reaching here you generated further tokens for the vectors.

So that's your path.

You went from here to here to here.

So this was the final point of response.

And now you got a score of plus one.

So this gets a score of plus one.

This also gets the score of plus one plus one plus one plus one plus one.

It's also discounting that you can do.

But for now let's just keep things simple.

This is a nice path.

You always want to follow this path.

Response two was bad there.

You followed this point to this point.

This point, and then you deviated.

The next token that you generated after the first three tokens, let's say, is not going.

And then you did a comma here and went, but it may be so token one, two, three for token one, two, three four. Okay.

four. Okay.

This was bad.

It got a score of minus one, which means this area gets a score of minus one.

This also gets the score of minus one, minus one, minus one, minus one, plus one takes it to zero.

Minus one plus one takes it to zero.

Minus one plus one takes it to zero.

So what you're doing is you have a space where you have negative scores, positive scores and neutral scores.

If you do this enough, then you will end up with a space, a vector space where given an input query, given a starting point, you will have a space of negative where you do not want to go.

You will have a space of positive where you definitely want to go.

And the more positive it is, the more you want to go there.

Okay, so maybe you go here.

From here you have another very positive space which is over here.

This is like hill climbing, right?

You're basically trying to optimize on the path that you're taking as a large language model.

The expectation is that the final result will make the end user happy.

Okay.

If the end user experience is good, then the model is trained to make users happy.

That's what is reinforcement learning with human feedback.

Human feedback is telling you whether it is a plus 1 or -1, and the feedback is helping you reinforce good outputs.

This is an extremely powerful technique.

In fact, it is seen in nature.

If you know about Pavlov's dog, then there was this situation where Pablo would press a bell and give food to the dog when it would come after pressing the bell.

Eventually he realized that if he just presses the bell without giving food, the dog already comes and starts salivating because it's expecting food.

So its behaviors have been reinforced.

Fortunately, this is not the only capability that human beings have.

You cannot model human intelligence using just reinforcement learning.

I'll take an example.

Let's say you have a coin which is giving you heads.

Heads. Heads.

Heads. Heads. Heads.

If you know that this is a fair coin.

If you have a mental understanding of how the coin works, then what do you think is coming next?

Heads or tails?

Okay.

With what? Probability?

Okay, so I just looked at the camera and said, okay, okay.

Twice. Something's going on.

But as a human being, you should look at this and say if it is a fair coin, if it's an unbiased coin, then it can be heads or tails.

You can't guarantee that it is going to be heads next.

But reinforcement learning looks.

It observes the real world and based on that makes a decision.

So when it predicts heads it gets reinforced.

Great job.

When it predicts tails, it gets punished.

Bad job.

But the reality is this is a fair coin.

So there's a 5050 chance of either.

If you ask a human being, you show them the coin.

You tell them that this is a fair coin, and then you just keep flipping the coin.

You get a lot of heads.

They're just going to say 5050 because they have an internal representation of how the coin works.

They have a mental model of the physics of the coin.

While reinforcement learning cannot build mental models, they can just tell you based on outcomes what is more likely and what is maybe a more beneficial path.

Okay, we are not crocodiles. We are humans.

We have a deeper understanding of how things work.

Having said that, reinforcement learning is a powerful technique.

It does make models get smarter.

Quite smart right?

Chain of thought.

Pretty simple concept, but very powerful.

When training the model, we clearly explain our thought process here.

The expectation is that as the model trains to break a problem step by step, it's going to look at newer problems with different parameters and still be able to reason through them because it has been trained to reason step by step.

This is called chain of thought, where the model goes through a series of deductions or inferences and comes up with the final response.

The quality of this response is usually much higher than a direct response.

As you can see, this is similar to a few short prompting.

The quality of the response is higher.

It has some examples to go through, but here the key difference is that there is a step by step breakdown, and new steps can be added by the model as it sees fit.

Because it is trained on so much training data, it may be able to reason to add more steps as the problem gets more and more difficult.

Okay.

In fact, this is something that has been seen by deep seek.

If you make the problem harder, it goes for more steps.

If you make the problem easy, then it goes for fewer steps.

So this is called a reasoning model.

Okay.

They do not necessarily need to do chain of thought.

They can also use other algorithms. For example there is tree of thought graph of thought also that you can go through.

You can use tools also to come up with better reasoning, but a model that can reason, a model that can figure out, given a problem, how to solve that problem step by step is a reasoning model.

This is also known as L or M's.

Okay, examples of this deep seek and OpenAI.

I mean the O one and O three another.

All these models but they newer models with new capabilities. Now

multi model models. Okay.

models. Okay.

So the basic idea is that most large language models that we know of operate on text.

But what about models which can accept and create images, generate images.

What about models which can accept and create videos. Okay.

So they can analyze images.

They can tell you the number of apples in an image, let's say.

Or they can modify an image to create a new image.

Similarly for video, these have tremendous application similar to how large language models have changed the marketing space.

To textual content.

Now, social media is rife with large language model content.

Images are going to get better and better, and video can be a really big deal.

Because if you have celebrities.

Who can create video?

You can create ads through large language models.

Then the cost expectation of creating video is going to go down okay.

This is already happening to some extent, but the quality of the models are not very good.

Multimodal in general means any kind of mode of input data.

It turns out that their performance is better than models which are just trained on text. Okay.

They have a deeper understanding of the meaning of objects.

If you train a model on cat and feline and so on, and then if you show it, images of cats, then the performance of the model, the output quality is usually better.

Okay.

The training is better.

Fine.

Let's get to three major topics, which is where the AI space is heading.

Okay.

People are looking for more company specific smaller models.

Foundation models.

The reason for this is companies want more control over what they generate.

They also want to keep the data close to themselves.

They don't want to expose it to any other third party company.

So one of the things which is happening is we are looking at smaller models.

Of small language models.

As you can expect with the words have fewer parameters than large language models.

For example, a small language model may have 3 million to 300 million parameters.

Okay, the neural network internally has fewer connections, fewer weights.

But if you look at large language models, contrast it.

You have 3 to 300 billion parameters.

So this is a very large neural network with a lot of weights in a LM.

But the SLM is smaller.

But they are useful because they are trained on lesser data, which can be company specific.

Or task specific.

For example, a bot which is trained on just customer queries, how to manage customer queries, how to make sales is likely to perform decently well.

Okay, it's going to be an expert at sales, but it probably can't tell you a detailed weather analysis.

For most companies, this doesn't matter.

In the case of NASA. This is what you need.

You are probably not selling anything openly, so maybe you are.

Who knows?

But NASA would be more interested in building a foundation model which can predict the weather, but not bothered about the sales part.

So in this way, smaller language models are being trained by companies on their specific data, on the proprietary data to come up with reasonably good responses for specific use cases.

And the process of building small language models is usually distillation.

The basic idea is you have a large language model, which is a teacher, and then you pass in some input.

You look at the output of the large language model, and in parallel you also send it to a small language model.

Okay, with fewer parameters.

You and it also tries to predict the output.

So the teacher produces an output and the student tries to mimic the teacher.

If these two outputs match, then the small language model is doing well.

No weights need to change, but if it is not doing well, then the internal weights of the small language model are changed.

But there is a limited number of weights assigned to this model 3 to 300 million.

What you are basically trying to do is condense this information, the the complex neural network, into the most reasonable representation that you can have such that your performance is okay, but the costs are significantly reduced.

So during runtime, during production inference time when you get a query, this is going to be much faster at responding as compared to this large language model.

It's also easier to host.

Okay.

Distilled models take us to the last term that you really should know if you are the engineer, and that is quantization.

Here the idea is that you have neural networks.

Each of these weights is basically a number, let's say a 32 bit number.

What if you could take these weights and condense that information into eight bits.

Then 75% of your memory is expected to be saved.

It doesn't directly map over here because the weights are usually just done on the feedforward neural network.

You still have the attention mechanism, and also the training cost is the same because initially you come up with a really good model with zero quantization.

Once the model is completely trained, that's when you apply quantization.

So the training cost does not reduce.

This is mainly to reduce inference cost or during production.

The cost of running a model.

So these are the most important 20 terms that I want to discuss in the engineering space.

I think knowing these terms will help you effectively communicate with any other engineer or people in the team.

I couldn't go into enough detail here because, I mean, when you're talking about the attention mechanism or quick action, you cannot do this in a 20 30 minute video.

But the things you should know about are these terms. And also most of the things that are mentioned in the engineering course are going to be ready.

If you know them, then you truly understand how these models work.

And all of the hype and nonsense which is going on in this space, they become hype and nonsense to you, right?

You are able to recognize it much better. Thank you for watching.

I hope you enjoyed the video. I'll see you next time. Bye bye.

Loading...

Loading video analysis...