Fine-Tune Visual Language Models (VLMs) - HuggingFace, PyTorch, LoRA, Quantization, TRL

By Uygar Kurt

Summary

## Key takeaways - **Chat with images using VLMs**: Visual Language Models (VLMs) allow users to interact with images by asking questions and receiving descriptive responses, similar to how one chats with traditional language models. [00:14] - **Fine-tuning VLMs with LoRA and Quantization**: To fine-tune large VLMs like Qwen2-VL-7B-Instruct on limited hardware, techniques such as LoRA (Low-Rank Adaptation) and 4-bit quantization are employed to reduce memory usage and computational load. [02:41], [02:59] - **Dataset Preparation for VLMs**: The ChartQA dataset, used for fine-tuning, requires specific formatting where each data point is structured as a list of dictionaries containing system prompts, user queries, and assistant answers, along with image data. [08:14], [09:02] - **Training and Evaluation Loop**: The training process involves configuring hyperparameters like learning rate, batch size, and evaluation steps, followed by initiating training using an SFT trainer and monitoring the evaluation loss to track model improvement. [34:05], [40:26] - **Post-training Inference**: After fine-tuning with LoRA adapters, the model's performance is evaluated through inference, comparing its generated answers to the ground truth, though significant improvements may require training on a larger dataset. [44:21], [44:45]

Topics Covered

Optimizing Training for Large Models with Limited GPU Memory
Custom Data Formatting is Essential for VLM Trainer Compatibility
LoRA Adapters: Efficiently Training Large Models with Frozen Base
Small Training Datasets Limit VLM Performance, Despite Fine-Tuning

Full Transcript

hey people today we are going to f tune

visual language models so these visual

language models have been very popular

and I thought I should make a video

about it and show you how to find two

such models what Vision language models

does is that you can chat with images

just as you chat with llms so we will

upload an image we will ask questions

about it and this Vision language models

will return us appropriate responses as

the model choice I will use Quan 2 VL 7B

instruct the reason being is that it is

a popular model and it is available in

Europe before we get started as usual

make sure you subscribe to my channel

hit that like button what you want to

see next please let me know in the

comments below if you're ready let's

just go as an editor I'm using Kegel's

notebook I find keegle pretty successful

with this they provide you gpus also

compared to the other options such as

collab they provide you with much higher

memory and much flexibility and freedom

so with such projects I usually just go

with kles notebooks but just use

whatever you want the first thing is we

are going to install some dependencies

which are bits and bites PFT and TRL

what you can do is just you'll just put

this exclamation mark pip Tre install

and these libraries let's just run this

after the run is complete one important

detail is that if you just continue it

may in the future it may give you an

error saying that b send by is not

installed even though we installed it

here not to encounter such an error all

you need to do is simple you just

restart your notebook and that's all now

let's continue with importing necessary

libraries the first thing I'm going to

do is import operating system and

disable weights and biases so it comes

turned on by default but in this case I

don't want to use it and to disable it

you just have to set this variable to

true we will need load data set from

data sets because we will load data sets

from hiding phas we will use py torch so

we will import Torch from Transformers

We Will import qu 2 VL for conditional

generation so this is the case for using

quen if you want to use it with a

different Vision language model you will

probably need to change this according

to your model and we have to import qu

to real processor also we will use

quantization to fit this model into our

little GPU so for this we have to import

bits and Bo config this is a large model

and especially if you are dealing with

images you need a very powerful gpus if

you don't have that one of the options

you can go is just use adapters such as

Laura which what we are going to do now

for this we are going to import Laura

config and get P model from P this will

enable us to put lower adapters into our

model and we have to import our trainers

and configs for this we are going to

import a FD config and a safd trainer

from TRL as last you may encounter some

annoying warnings they annoy me so what

I do is I just import warnings and

warnings that filter warnings ignore

this filters at least some of them and

let's just do these Imports now we

imported the necessary libraries let's

define some hyperparameters that we are

going to use throughout the code let's

start by setting our device it is Cuda

if torch. Cuda is available else it is

CPU and we want to print which device we

are using our model is Quan Quan 2 real

7B instruct this is the hiding face pad

that showed in the introduction number

of EO it is one I set it to one because

okay this is a big model and we have

small GPU for this but if you have a

stronger GPU you can set it to maybe two

or even three B size again in my case it

is one because I have a small GPU if you

have a stronger GPU don't hesitate to

set it to a higher number gradient

checkpointing equals to two so what it

does is that in standard B propagation

we store for the intermediate

activations so it speeds up the process

however since we store them it takes up

some GPU space with this what we do is

that instead of storing them we

recalculate them this increases the

computation time however it frees up

some GPU memory for us so I'm going to

set this to True use rrand this was just

giving a warning with gradient

checkpointing so I set it to false as an

Optimizer I'm going to use page adding

we 32 bit learning rate I set it to 2 e

minus 5 remember that these are hyper

parameters so you can just play with

them logging steps so this decides how

many steps it's going to log the

training and validation loss for us in

my case I set up the 50 because as you

will see in total we will have around

200 steps in total so 50 was just giving

me enough information about the training

process evaluation steps I also seted to

50 so at every 50 step it will evaluate

the model on the validation data set and

will return us a loss value save steps

50 so at every 50 step it will save the

model and the evaluation strategy we

want to evaluated based on the steps so

we are going to set it to steps also we

want to save it according to the steps

so we set the save strategy two steps in

the end we want to decide on the best

model throughout the iteration

throughout the steps but how we are

going to decide the M model throughout

the training it doesn't have to be the

last Model maybe before the training

ceased we had a better model compared to

the latest model how do we decide that

is we are going to use evoluation loss

in this case so I set metric for best

model to have a loss and in the end I

want a Lo the best model which I chose

with metric for the best model so this

trainer that we will do will return as

the best model in the end based on the

evaluation loss I set the max gradiant

normalization to one you can play with

this I haven't done any warm-ups again

it is up to you to do some by default

the trainer that we are going to use do

some data set preparation however with

visual language models we don't want to

do that because we will do this by

ourselves so we will pass some data sets

QE arguments and we will pass skip

prepare data set to as a dictionary we

will pass remove unused columns equals

to false that's just a trainer thing

next sequence length this is the maximum

SE length of the text that the model

will produce I set to 228 is up to you

you can also change this and the number

of steps how it is calculated is that it

is the length of the data set divided by

batch size multiplied by number of epo

the data set I know that it has two 283

data points in it so I just hard corded

it here and it gets divided by B size

and multiplied by epox I find that it is

a good practice to know how many steps

your training will take depending on

your data set the number of points in

your data set you can calculate it by

changing this

283 and I want to print the number of

steps in the end and let's just run the

cell and our hyperparameter definitions

are complete we are using Cuda and the

number of total number of steps we have

is

283 since we have bch size of one and

epoc of one it is just the number of

data points our data will need to have a

format that is compatible with visual

language models and safety trainer so

for every data point in our data set we

will try to make that point in the

format that we want for this first we

will Define a system message so what I

say is you're a highly Advanced Vision

language model specialize and analyzing

describing and interpreting visual data

I wrote this because we will working we

will be working with analyzing and

describing and interpreting visual data

your task is to blah blah blah

leveraging blah blah blah and that's all

and we will have a function called

format data which will format the data

as we want in this case I want the data

sets to have a format like this it will

be a list of dictionaries one dictionary

will have the role in this case it is a

system and as a Content it is a type of

text and it has a system message as the

content and the second dictionary will

contain the user data so the role will

be user and the content will be first

the image that is fed by the user so as

a type I put image and the image will be

image part of the sample that we will

provide so we will take a look how this

sample will look like in this case let's

just assume that this key image will get

us the actual image and we will of

course have the text which is the query

provided by the user so its type will be

the text and it will be the query part

of the sample that we will provide and

as last we need the answer from our

assistant so as a rle we put the

assistant as the content we have the

type of text and text we have the sample

label and the zero index of it like I

said we will go over this function a

little bit more but for now let's just

let it stay like this uh let's run this

now let's load our data set to do this

we will just put load data set and the

data set we are going to use is a chart

QA which is a question answer data set

based on on some visual data such as

graphs as split we have train validation

and test so again we have I have a GPU

bottleneck so for this I will use only

1% of the data set you can do this by

setting this like this here 1% 1% and 1%

now let's load our data set and let's

start to see what is actually inside

data set is loaded I like to begin with

printing the length of the data set so

let's see how many data points we have

by printing the length of it so it is

283 so if you remember I already put

this to here to calculate our number of

steps so this is where this 283 comes

from the length of your data as the next

thing let's see how actually our data

set looks like so it has four columns

image query label and another thing

which is human or machine which we we

want use let's also take a look at the

test data set and let's say print test

data set so we have 25 data T data

points and have the same labels and

let's print one of the roles of it let's

take the zero data point which we can

get with print data set zero and we have

this so we have the image column which

we can see here which contains the image

file we have the query which is is the

input that comes from the user how many

food items is shown in the bar graph and

we have the label as an answer this

column which in this case it is 14 and

we have this human or machine column

which we want use and actually let's

take a look at this image we can do this

by saying test data set we are looking

at the zerro and we will access to the

image so this is how this image looks

like long-term price index in food

commodities so like I said we will be

using with data in image format so the

question is how many food items is shown

in the graph bar in this case 1 2 3 4 5

6 7 8 9 10 11 12 13 and 14 so the answer

which is label is 14 before moving on

let's make these prints in a more order

organized manner I will just print those

data just after we load the data set so

that it will be just visible for us I

will print length of the train data set

adjust the train data set so that we can

see how it looks like I will print the

ass sample point which is zero so that

we can also see how it looks like and

for now that's all and let's just rerun

this as you can see we can see

everything nicely here number of data

points in train how the data set looks

like and the sample data from the data

set now let's remove these cells since

we understood how more or less this data

looks like let's start to format it to

format the data we will use this format

data we defined above so actually let's

see how does the sample will look like

we will use this function we will pass

this every single data point so to do

that what we going to do is we will

again Define our train data set and we

will apply our format data

function we will put a

sample for every sample in train data

set what it's going to do is that it

will put every this data point as we can

see here into this function so let's

again come back to this function so this

sample will be a data point that is in

the exact same format as this so for the

image that's why we say this is the

sample and if we access it to the image

key it gives us this image file so we

can just access it by sample image same

with the query the key is query and the

value is the actual query so we are

going to access it by saying simple

query and again uh response of the

assistant which is the actual answer

will be inside this data point in Sample

remember that this was the answer so how

we are going to access it is give the

label as the key and since this is a

list of only one value we will just put

zero to access that and this is how we

will format our data now this is

important because remember what we said

in the beginning we will format and

prepare our data for the trainer by

ourselves we passed skip prepare data

set two so we will do it by ourselves we

will do this for evaluation and test as

well and let's just put them take them

from

here and put them here CU I think it is

more organized like this and let's just

run the cell okay now we have train

evaluation and test data set with

formatting so let's actually check how

it looks like let's check again the

length we expect it to be the

same yeah so the length is still the

same

20083 now let's actually see what is

inside in each every data point for this

again we are going to do print train

data set

zero and now each data point in this

data set looks like this the format we

decided instead of like this as you can

see it is a list of dictionaries R

system the content is our system prompt

you are a highly Advanced visual

language model and here our image file

here our query and here our answer as

yes as you can

see so we can say that okay our data set

preparation is worked well also let's

look at the test set because it will

test most of the things on the test set

so let's print the length of it 25 and

let's just access to a

sample as you can see again you are a

highly Advanced Vision language model

this is our image this is our query how

many food items and this is the answer

which is 14 again let's put this in this

cell for it to look more organized let's

print the length of the train data set a

sample from the data set zero data point

so what I'm trying to do actually is to

compare these data sets before and after

this formatting so train data set let's

take the zero index of course I don't

print the whole data set because it is a

list and is too long okay so I'm just

taking a one sample and I am also

looking at the test data set length of

it and let's take a sample data point

which is the zero one let's run this

again okay so this is how it looks

before and this is how it looks after

the formatting we can delete these

cells there are a couple more things

that I want to do with the data set

before moving on first first is I want

to take a sample data so that we can

test the model before training so I'm

going to create a variable called sample

data and I will just take the zero test

data set so this is this one

okay and I want to take the question

which is query in it which is how many

food items dra in the paragraph so I

will create a variable called sample

question

which will be the test data set I'm

taking the zero data so

zero if you look at here this question

is in the first index which belongs to

this dictionary so I'm going to access

it by saying one it is under content so

I will put the key content so this is a

list and here

we are just at the first element because

the zero element is the graph itself as

you can see this is at the first element

so I put one and the key of it is text

so this supposed to give us there how

many food items shown in the graph bar

let's just test it run this print sample

question and yes it gives us how many

food item is shown in the bar graph so

it is correct we can remove this now I

want to have the sample answer which is

14 to do that I will follow something

similar sample answer equals to test

data set again we are accessing to the

zero data now this is in the actually

okay so this is a bit crumbled up but

you know this is in this format so we

can just easily come here and take a

look we want to access this so this is

in zero and at the second

element so we say two we go up again it

is in the key content so we will give

the key content here it only has one

element anyway so we will access it to

the zero element and we will put the key

text let's run this and let's see see if

it is correct by printing it sample

answer and it is 14 so that's also fine

and next I want to take the image itself

which is this so we will again follow

something similar sample image equals to

test data set

zero now again let's go

up the image is at the first index under

the key content

so let's just put here first index under

the key

content again let's go

up and this is a list so it is the zero

element which is here now we have a

dictionary it is under the key

image so we come here we put zero and we

put image and let's see if it is correct

let's run the cell and let's show

image and yes we correct access to the

image because this is our graph and now

actually let's just put everything in

one cell let's remove this let's print

sample question sample answer and let's

show the image and let's run this okay

so as you can see the question is how

many food item is shown in the graph bar

the answer is 14 and this is our graph

and this is the actually problem we are

soling you want to ask a question such

as how many food item is shown in graph

bar and we will try to teach the model

that it is 14 now we are more or less

done with the data set preparation we

can move on to loading the model so

sometimes when I use for debugging I

just switch to CPU and after that I

switch to GPU and I don't want to change

my code every time so when loading the

model I usually use an IFL statement to

make the code more robust in my case for

this I just check if device is chuda

which means that if we are using a GPU

first we are going to Define our

quantization for this we will use bits

and bytes config we will pass the

parameters such as loading forbit use

double Quant Etc I won't go into the

details of what these parameters are

it's out of the scope but just know that

this is a pretty good configuration for

quantization you can just use it now we

are loading the model we are going to

use qu 2 VL for conditional generation

that from pre-trained our model ID we

already defined it device map we want

the devices to maap automatically

quantization config we just defined it

is the bmbb config so use cachy by

default is true so what it says is that

whether to use qwi cachy or not when you

store the keys and values however you

use some additional gpus like we

mentioned in this case GPU is a bit of

bottleneck and training is already

resource consuming so not to get some

cud error I just set this to false and

that's all for the model loading if we

are not in Cuda it means most probably

when we are on CPU again we will do

something similar but we will not pass

the quantization configuration

we just pass model ID use cachy is false

and that's all after the model loading

is complete we have to load our

processor to do this we simply do CR to

real processor. from pre-trained again

we pass the model ID in this case we set

the tokenizer padding size as right for

auto regressive models it is a good

practice to set the padding side to

right so we can just run the cell and it

will load our model as well as our

processor now the model has been loaded

let's just do a sample text generation

ass sample inference and let's see how

this model Works how this model actually

generates text for this we will create a

function that we will call text

generator and we will pass a sample data

to it so what sample data is actually we

Define above here it is just a point

from the data set in this case a test

data set for example zero index of the

text data set I make it sample data so

we will create this function such that

it will generate text and first things

first let's see how this let's remind

ourselves how the sample data looks like

sample data so it looks like this like

any text generation the thing that we

don't want to include is the answer

because the this 14 which is the answer

is what we don't want our model to see

and actually to find out and this 14 is

on the second index as you can see

content is the for 14 which is the

answer so we don't want to include this

to our sample data with this in mind

let's just start let's remove these two

cells first we want to get our processor

and apply chat

template this will apply the chat

template to our sample data this is a

very handy function to use however if we

put sample data directly to it it will

also include the answer remember that we

just discussed about it so we don't want

to include the answer which is in the

second index we just looked at it so how

we can do that is easily we are just

going to say 0 to 2

so this will not include the result to

our input text I don't want it to

tokenize so I will pass tokenize equals

to

false and I want to add generation

prompt so I will pass it to

true and let's actually see how our

prompt looks like now so I will just say

print prompt which is in the te which is

the text in this case because we applied

the chat template and we will just take

a look at it let's just call the

function text

generator and simple

data as you can see the chat template

has been applied so this is our prompt

system which is the system prompt you

are highly Advanced visual language

model which we passed here in the

beginning and the user message is how

many food items is shown in the bar

graph and as system

we do not include it we expect the model

to generate it so let's get the model to

generate it to do this first what we

need is obviously we need the image so

we will name it image

inputs and where it is located is again

let's let's look at here it is this guy

so how did we access that is if we look

at here sample image it is the first

index content and image so we already

accessed it and we will just do the same

thing and our sample image is already

test data

zero so we will just put it equals to

sample

data and paste it one content zero image

goes to that image data and let's be

sure and let's just print it print image

inputs and let's see as you can see this

this is our image file let's remove it

to prepare this inputs to our model we

have to use processor so inputs which

will be the input of the model will be

equal to

processor we will pass

text we will pass image

inputs and as a return tensor we wanted

to return py torch

tensors and we want to since we are

working on a GPU or CPU we want to put

this inputs to whatever device we are

working with we can do that by

two now our inputs are ready now it's

time to generate the text for this we

will just call the

model model.

generate we want to unpacked

inputs and how many tokens we want to

generate is set by Max new

tokens we already set that as the hyper

parameter in the beginning so we can

just pass

it which is the max sequence length now

our texts are actually generated to make

them something sensible what we have to

do is we have to decode them using the

processor how we are going to do that is

it we will have output text we will do

it by

processor batch decode

as an input we will put generated

IDs and I want to skip special token so

that we will have a simple to understand

output to free up some memory since we

have the generated text I will delete

inputs to return I will also return the

actual answer to compare the generated

output so let's say actual

answer if you remember if you go up the

actual answer is here at this

location so we will just say sample data

and that location and return them

both so output text will return a list

of one element so we will just get the

zero one which is also the only

element and return the actual

answer so we are returning two things

I'm want to modify ify this to generate

a text and actual

answer and I want the print to generate

the text and actual answer do that I

will just say print generated text is

generated text actual answer is the

actual answer and let's see and compare

the generated answer is the bar graph

shows the long-term price index for 11

different food commodities these

Commodities are lists 12 Commodities and

says there are 12 food items shown in

the graph bar and the actual answer is

supposed to be 14 so as we can observe

the answer is both inconsistent and

incorrect Our Hope with the training is

to correct it and make the model capable

of understanding data from images now we

are slowly moving to the training part

before that first we have to configure

our low adapters to do that we will

create a variable called P config and it

will be a loraa config with Lowa alpha

16 drop out 0.1 R8 Etc again I won't go

to into the details of these parameters

but this is a just a good configuration

what we expect after putting the

adapters to our model is that the number

of parameters of our model will slightly

increase also since we will only train

the adapters and freeze the the rest of

the model we expect to see the trainable

parameters to be a small number compared

to the number of parameters of the model

to do that comparison first I will print

the number of parameters before putting

the lower adapter which we can get by

model. n parameters after that we will

apply PFT config to our model we will do

it with get PA model it will convert our

model into the Laura adapter model which

we call Pa model in this case and we

print the trainable parameters of this

PA model with print trainable parameters

now let's run the cell this is number of

parameters our model has before the

adapters and this is after the adapters

as you can see it has a slight increase

after this nine there is one here on the

contrary there is three here so adapters

are getting mounted and also the

trainable parameters very low as you can

see here since we are only training the

adapters now we have the lorda

configurations let's move on and set our

training Arguments for this we will say

training arguments equals the sft config

which we will put our configurations

output directory is where will your

checkpoints and models will be stored I

just said output and this is actually

since we defined everything in the the

beginning at the hyper parameter section

this will be very straightforward number

of training EPO it will be the EPO we

set same for per device train batch size

we set it as the batch size evaluation

batch size again we set it our batch

size gradient checkpointing we decided

we will set it to two in the hyper

parameter section so we keep it as well

learning rate we set the learning rate

again logging steps we set the logging

steps

evaluation steps as well evaluation

strategy as well also we explained what

do they do in the hyper parameter

section so I won't go over them again

save strategy save steps metric for best

model load best model at the end match

gradient normalization warmup steps data

set Qs match SE length remove unused

columns Optimizer and that's all like I

said we already decided all of them and

explain what do they do during the hyper

parameter section so I won't go over

them again but since we set it this part

is pretty straightforward so let's run

our

configurations another thing we need to

do since we set with data set QE Arts we

will do the data preparation by by

ourselves we need to a function that

will actually do the data preparation

and we will pass it to the trainer so

since we are using a saf trainer

everything is supposed to be in the

format that they are expecting so this

is the drawback okay it makes everything

easier but also it doesn't give you that

much flexibility and you cannot really

see what is going under the hood however

in the end this function will return a

dictionary that will have input IDs

attention mask and the labels and the

pixel values so let's start by creating

a sample data that we will pass to the

this function let's call that Collide

sample and let's just put two data train

data set let's take the zero index and

first index so here I am assuming we are

using a bch size two so even though we

are using bch size one I'm showing it

with bch size two so that everyone can

just understand and use it I'm going to

call our function define colite FN oh it

is pretty simple to what we did

previously we are going to apply CH

template to every example so it is

simple to what we did for the text

generation okay and we will do it for

all of the examples so we just do a list

compression for example and examples

just compare to text generation I didn't

take only first two values because since

this is a training we also want to give

the actual answers to the model again

image inputs this is how we accessed to

our image in the text generation part

for example in examples we just do a

list comprehension now we start

preparing batches so again we put our

processor we put our text as text in

image inputs as image inputs we put our

return anwers and additionally since it

can have batches we will put padding and

make it equals to true so up to now it

is more or less the same what we did

with the text generation

additionally what we want to do is okay

we want to give this trainer the data in

a format that they want this why we will

create a labels and we will just clone

the input IDs from our batch for every

pet token we will give it an attention

mask with minus 100 you can set your

attention mask so this part just sets

the attention mask and we create a colum

a key called labels and we just pass it

the input IDs and we just return that

back and now to test our function we

just call it colate FN we pass our

sample data here and we call it collated

data and let's print the collated data

keys so as I mentioned in the beginning

it will expect to have input IDs

attention Mass pixel values and labels

the labels we manually added but the

rest has to be there and as we can see

we have image IDs attention Mass pixel

values we have an additional column but

it does doesn't matter and also we have

the labels so the collate FN function is

good to go now we are creating our

trainer that will do the training for

this we will say trainer equals to sft

trainer again it is a straightforward

process we pass our model so if we

realized we passed the regular

model not the P model we created here

because the trainer will apply this

adapter by

itself so we just put the we just pass

our model we pass our training arguments

that we defined above we pass our

training data set as well as evaluation

data set we pass our data collector

which we just defined as colate FN we

pass our P config which is the Lowa

configuration and for the processing

class we pass the tokenizer of our

processor and that's it let's run this

now it is pretty simple to start the

training however to get the initial

evaluation of the model without any

training I will initially add one more

step I will do an initial evaluation

which is called by trainer. evaluate and

this will return us the metrix for this

model evaluation metrix without any

training and I will print this metric to

compare with every step that will be

logged and after that we will start the

training for that I will just print

training and and we just say trainer.

train and it will start the training and

let's actually start the training so it

will take some time okay depending on

your GPU even though we are doing a

small training so after this is complete

see you right here now our training is

complete you should have a table like

this so let's go over it initially we

want to look at evaluation loss

initially our evaluation loss was

1366 then at the 50th step it became

13.16 then 12 then 10 then 8.6 then 8 so

we can confirm that we did the training

and the model has improved but since it

was a very small training with a small

data set the model wouldn't change

drastically if you want to create a more

powerful model I would suggest to do a

full data set training okay since I

don't have that GPU I did this training

just to show how it is done now this is

is done we can save our model our best

model like this trainer. save model and

we want to save it to our output

directory in the training arguments

which is just output since we set if you

go about we set load best model at end

this will directly save the best model

out here in this case it is the last

model in the last step but sometimes

what can happen is that at one point you

have the best model but with more

training your loss starts to go up or

something similar in those cases this

will come handy we can just save our

model like this now let's do some

inference and let's test our new model

but first we want to clear out some

memory for this we will use a Cod

snippet like this so I took this from

here which I will also put it as a

reference it is also a VM fine tuning

block which I suggest you check it out

too so this is a good script to clear

out all memory so we will just run this

and now we freed up some gpus from the

training to do the inference of course

what we need to do is again we will have

to load our model to do that you can

just go up copy this cell and paste it

again because we are using the same

thing however since we are not doing

training and we are doing inference we

can use the cachy so I will set this to

true and I will also set this to true

and let's load the base model again now

the model loading is complete so if you

realize again at this point we loaded

the regular model so at the next step

since we did a Laura training with

adapters we want to mount our adapters

on top of the base model and how we are

going to do that is we are going to say

model. load

adapter. output so this is where we

saved our model as well as the adapters

so when we say model. load adapter it

will load the adapter hence we will have

our latest train model and also let's do

our regular check of the number of

parameters and the adapters of Lura

let's print the number number of

parameters our model has before the

adapters and Also let's check the number

of parameters after we have loaded the

adapters as you can see our number of

parameter of the model has increased

because we loaded the Lura adapter now

let's actually do a sample text

generation we already have a function

about and we try to generate text before

training so let's go

there and let's just copy these three

lines we already have a function to

generate text and we already have the

sample

data let's print this and run it and let

our new model generate some text so

let's go to the generated answer this is

our question assistant and there are 11

food items show in the bar graph so the

actual answer is 14 it cut it wrong

again however if we compare with the

previous

answer which is here let's just copy

here copy

this and paste here at least it is more

consistent it didn't list us 12

Commodities and set for 11 different

food commodities okay so there's some

improvement of course it is not it is

just not there yet which is pretty much

expected with such a small training but

if you continue do the training with the

full data instead of 1% of the data it

will get there so this is how you train

a visual language model in in this case

we trained Quan 7B visual language model

thank you for watching hope you learned

something new and enjoyed doing this

experiments don't forget to like the

video subscribe to my channel comment

out what you want to see next and until

next time bye-bye

Loading...

Loading video analysis...