Fine-Tune Visual Language Models (VLMs) - HuggingFace, PyTorch, LoRA, Quantization, TRL
By Uygar Kurt
Summary
## Key takeaways - **Chat with images using VLMs**: Visual Language Models (VLMs) allow users to interact with images by asking questions and receiving descriptive responses, similar to how one chats with traditional language models. [00:14] - **Fine-tuning VLMs with LoRA and Quantization**: To fine-tune large VLMs like Qwen2-VL-7B-Instruct on limited hardware, techniques such as LoRA (Low-Rank Adaptation) and 4-bit quantization are employed to reduce memory usage and computational load. [02:41], [02:59] - **Dataset Preparation for VLMs**: The ChartQA dataset, used for fine-tuning, requires specific formatting where each data point is structured as a list of dictionaries containing system prompts, user queries, and assistant answers, along with image data. [08:14], [09:02] - **Training and Evaluation Loop**: The training process involves configuring hyperparameters like learning rate, batch size, and evaluation steps, followed by initiating training using an SFT trainer and monitoring the evaluation loss to track model improvement. [34:05], [40:26] - **Post-training Inference**: After fine-tuning with LoRA adapters, the model's performance is evaluated through inference, comparing its generated answers to the ground truth, though significant improvements may require training on a larger dataset. [44:21], [44:45]
Topics Covered
- Optimizing Training for Large Models with Limited GPU Memory
- Custom Data Formatting is Essential for VLM Trainer Compatibility
- LoRA Adapters: Efficiently Training Large Models with Frozen Base
- Small Training Datasets Limit VLM Performance, Despite Fine-Tuning
Full Transcript
hey people today we are going to f tune
visual language models so these visual
language models have been very popular
and I thought I should make a video
about it and show you how to find two
such models what Vision language models
does is that you can chat with images
just as you chat with llms so we will
upload an image we will ask questions
about it and this Vision language models
will return us appropriate responses as
the model choice I will use Quan 2 VL 7B
instruct the reason being is that it is
a popular model and it is available in
Europe before we get started as usual
make sure you subscribe to my channel
hit that like button what you want to
see next please let me know in the
comments below if you're ready let's
just go as an editor I'm using Kegel's
notebook I find keegle pretty successful
with this they provide you gpus also
compared to the other options such as
collab they provide you with much higher
memory and much flexibility and freedom
so with such projects I usually just go
with kles notebooks but just use
whatever you want the first thing is we
are going to install some dependencies
which are bits and bites PFT and TRL
what you can do is just you'll just put
this exclamation mark pip Tre install
and these libraries let's just run this
after the run is complete one important
detail is that if you just continue it
may in the future it may give you an
error saying that b send by is not
installed even though we installed it
here not to encounter such an error all
you need to do is simple you just
restart your notebook and that's all now
let's continue with importing necessary
libraries the first thing I'm going to
do is import operating system and
disable weights and biases so it comes
turned on by default but in this case I
don't want to use it and to disable it
you just have to set this variable to
true we will need load data set from
data sets because we will load data sets
from hiding phas we will use py torch so
we will import Torch from Transformers
We Will import qu 2 VL for conditional
generation so this is the case for using
quen if you want to use it with a
different Vision language model you will
probably need to change this according
to your model and we have to import qu
to real processor also we will use
quantization to fit this model into our
little GPU so for this we have to import
bits and Bo config this is a large model
and especially if you are dealing with
images you need a very powerful gpus if
you don't have that one of the options
you can go is just use adapters such as
Laura which what we are going to do now
for this we are going to import Laura
config and get P model from P this will
enable us to put lower adapters into our
model and we have to import our trainers
and configs for this we are going to
import a FD config and a safd trainer
from TRL as last you may encounter some
annoying warnings they annoy me so what
I do is I just import warnings and
warnings that filter warnings ignore
this filters at least some of them and
let's just do these Imports now we
imported the necessary libraries let's
define some hyperparameters that we are
going to use throughout the code let's
start by setting our device it is Cuda
if torch. Cuda is available else it is
CPU and we want to print which device we
are using our model is Quan Quan 2 real
7B instruct this is the hiding face pad
that showed in the introduction number
of EO it is one I set it to one because
okay this is a big model and we have
small GPU for this but if you have a
stronger GPU you can set it to maybe two
or even three B size again in my case it
is one because I have a small GPU if you
have a stronger GPU don't hesitate to
set it to a higher number gradient
checkpointing equals to two so what it
does is that in standard B propagation
we store for the intermediate
activations so it speeds up the process
however since we store them it takes up
some GPU space with this what we do is
that instead of storing them we
recalculate them this increases the
computation time however it frees up
some GPU memory for us so I'm going to
set this to True use rrand this was just
giving a warning with gradient
checkpointing so I set it to false as an
Optimizer I'm going to use page adding
we 32 bit learning rate I set it to 2 e
minus 5 remember that these are hyper
parameters so you can just play with
them logging steps so this decides how
many steps it's going to log the
training and validation loss for us in
my case I set up the 50 because as you
will see in total we will have around
200 steps in total so 50 was just giving
me enough information about the training
process evaluation steps I also seted to
50 so at every 50 step it will evaluate
the model on the validation data set and
will return us a loss value save steps
50 so at every 50 step it will save the
model and the evaluation strategy we
want to evaluated based on the steps so
we are going to set it to steps also we
want to save it according to the steps
so we set the save strategy two steps in
the end we want to decide on the best
model throughout the iteration
throughout the steps but how we are
going to decide the M model throughout
the training it doesn't have to be the
last Model maybe before the training
ceased we had a better model compared to
the latest model how do we decide that
is we are going to use evoluation loss
in this case so I set metric for best
model to have a loss and in the end I
want a Lo the best model which I chose
with metric for the best model so this
trainer that we will do will return as
the best model in the end based on the
evaluation loss I set the max gradiant
normalization to one you can play with
this I haven't done any warm-ups again
it is up to you to do some by default
the trainer that we are going to use do
some data set preparation however with
visual language models we don't want to
do that because we will do this by
ourselves so we will pass some data sets
QE arguments and we will pass skip
prepare data set to as a dictionary we
will pass remove unused columns equals
to false that's just a trainer thing
next sequence length this is the maximum
SE length of the text that the model
will produce I set to 228 is up to you
you can also change this and the number
of steps how it is calculated is that it
is the length of the data set divided by
batch size multiplied by number of epo
the data set I know that it has two 283
data points in it so I just hard corded
it here and it gets divided by B size
and multiplied by epox I find that it is
a good practice to know how many steps
your training will take depending on
your data set the number of points in
your data set you can calculate it by
changing this
283 and I want to print the number of
steps in the end and let's just run the
cell and our hyperparameter definitions
are complete we are using Cuda and the
number of total number of steps we have
is
283 since we have bch size of one and
epoc of one it is just the number of
data points our data will need to have a
format that is compatible with visual
language models and safety trainer so
for every data point in our data set we
will try to make that point in the
format that we want for this first we
will Define a system message so what I
say is you're a highly Advanced Vision
language model specialize and analyzing
describing and interpreting visual data
I wrote this because we will working we
will be working with analyzing and
describing and interpreting visual data
your task is to blah blah blah
leveraging blah blah blah and that's all
and we will have a function called
format data which will format the data
as we want in this case I want the data
sets to have a format like this it will
be a list of dictionaries one dictionary
will have the role in this case it is a
system and as a Content it is a type of
text and it has a system message as the
content and the second dictionary will
contain the user data so the role will
be user and the content will be first
the image that is fed by the user so as
a type I put image and the image will be
image part of the sample that we will
provide so we will take a look how this
sample will look like in this case let's
just assume that this key image will get
us the actual image and we will of
course have the text which is the query
provided by the user so its type will be
the text and it will be the query part
of the sample that we will provide and
as last we need the answer from our
assistant so as a rle we put the
assistant as the content we have the
type of text and text we have the sample
label and the zero index of it like I
said we will go over this function a
little bit more but for now let's just
let it stay like this uh let's run this
now let's load our data set to do this
we will just put load data set and the
data set we are going to use is a chart
QA which is a question answer data set
based on on some visual data such as
graphs as split we have train validation
and test so again we have I have a GPU
bottleneck so for this I will use only
1% of the data set you can do this by
setting this like this here 1% 1% and 1%
now let's load our data set and let's
start to see what is actually inside
data set is loaded I like to begin with
printing the length of the data set so
let's see how many data points we have
by printing the length of it so it is
283 so if you remember I already put
this to here to calculate our number of
steps so this is where this 283 comes
from the length of your data as the next
thing let's see how actually our data
set looks like so it has four columns
image query label and another thing
which is human or machine which we we
want use let's also take a look at the
test data set and let's say print test
data set so we have 25 data T data
points and have the same labels and
let's print one of the roles of it let's
take the zero data point which we can
get with print data set zero and we have
this so we have the image column which
we can see here which contains the image
file we have the query which is is the
input that comes from the user how many
food items is shown in the bar graph and
we have the label as an answer this
column which in this case it is 14 and
we have this human or machine column
which we want use and actually let's
take a look at this image we can do this
by saying test data set we are looking
at the zerro and we will access to the
image so this is how this image looks
like long-term price index in food
commodities so like I said we will be
using with data in image format so the
question is how many food items is shown
in the graph bar in this case 1 2 3 4 5
6 7 8 9 10 11 12 13 and 14 so the answer
which is label is 14 before moving on
let's make these prints in a more order
organized manner I will just print those
data just after we load the data set so
that it will be just visible for us I
will print length of the train data set
adjust the train data set so that we can
see how it looks like I will print the
ass sample point which is zero so that
we can also see how it looks like and
for now that's all and let's just rerun
this as you can see we can see
everything nicely here number of data
points in train how the data set looks
like and the sample data from the data
set now let's remove these cells since
we understood how more or less this data
looks like let's start to format it to
format the data we will use this format
data we defined above so actually let's
see how does the sample will look like
we will use this function we will pass
this every single data point so to do
that what we going to do is we will
again Define our train data set and we
will apply our format data
function we will put a
sample for every sample in train data
set what it's going to do is that it
will put every this data point as we can
see here into this function so let's
again come back to this function so this
sample will be a data point that is in
the exact same format as this so for the
image that's why we say this is the
sample and if we access it to the image
key it gives us this image file so we
can just access it by sample image same
with the query the key is query and the
value is the actual query so we are
going to access it by saying simple
query and again uh response of the
assistant which is the actual answer
will be inside this data point in Sample
remember that this was the answer so how
we are going to access it is give the
label as the key and since this is a
list of only one value we will just put
zero to access that and this is how we
will format our data now this is
important because remember what we said
in the beginning we will format and
prepare our data for the trainer by
ourselves we passed skip prepare data
set two so we will do it by ourselves we
will do this for evaluation and test as
well and let's just put them take them
from
here and put them here CU I think it is
more organized like this and let's just
run the cell okay now we have train
evaluation and test data set with
formatting so let's actually check how
it looks like let's check again the
length we expect it to be the
same yeah so the length is still the
same
20083 now let's actually see what is
inside in each every data point for this
again we are going to do print train
data set
zero and now each data point in this
data set looks like this the format we
decided instead of like this as you can
see it is a list of dictionaries R
system the content is our system prompt
you are a highly Advanced visual
language model and here our image file
here our query and here our answer as
yes as you can
see so we can say that okay our data set
preparation is worked well also let's
look at the test set because it will
test most of the things on the test set
so let's print the length of it 25 and
let's just access to a
sample as you can see again you are a
highly Advanced Vision language model
this is our image this is our query how
many food items and this is the answer
which is 14 again let's put this in this
cell for it to look more organized let's
print the length of the train data set a
sample from the data set zero data point
so what I'm trying to do actually is to
compare these data sets before and after
this formatting so train data set let's
take the zero index of course I don't
print the whole data set because it is a
list and is too long okay so I'm just
taking a one sample and I am also
looking at the test data set length of
it and let's take a sample data point
which is the zero one let's run this
again okay so this is how it looks
before and this is how it looks after
the formatting we can delete these
cells there are a couple more things
that I want to do with the data set
before moving on first first is I want
to take a sample data so that we can
test the model before training so I'm
going to create a variable called sample
data and I will just take the zero test
data set so this is this one
okay and I want to take the question
which is query in it which is how many
food items dra in the paragraph so I
will create a variable called sample
question
which will be the test data set I'm
taking the zero data so
zero if you look at here this question
is in the first index which belongs to
this dictionary so I'm going to access
it by saying one it is under content so
I will put the key content so this is a
list and here
we are just at the first element because
the zero element is the graph itself as
you can see this is at the first element
so I put one and the key of it is text
so this supposed to give us there how
many food items shown in the graph bar
let's just test it run this print sample
question and yes it gives us how many
food item is shown in the bar graph so
it is correct we can remove this now I
want to have the sample answer which is
14 to do that I will follow something
similar sample answer equals to test
data set again we are accessing to the
zero data now this is in the actually
okay so this is a bit crumbled up but
you know this is in this format so we
can just easily come here and take a
look we want to access this so this is
in zero and at the second
element so we say two we go up again it
is in the key content so we will give
the key content here it only has one
element anyway so we will access it to
the zero element and we will put the key
text let's run this and let's see see if
it is correct by printing it sample
answer and it is 14 so that's also fine
and next I want to take the image itself
which is this so we will again follow
something similar sample image equals to
test data set
zero now again let's go
up the image is at the first index under
the key content
so let's just put here first index under
the key
content again let's go
up and this is a list so it is the zero
element which is here now we have a
dictionary it is under the key
image so we come here we put zero and we
put image and let's see if it is correct
let's run the cell and let's show
image and yes we correct access to the
image because this is our graph and now
actually let's just put everything in
one cell let's remove this let's print
sample question sample answer and let's
show the image and let's run this okay
so as you can see the question is how
many food item is shown in the graph bar
the answer is 14 and this is our graph
and this is the actually problem we are
soling you want to ask a question such
as how many food item is shown in graph
bar and we will try to teach the model
that it is 14 now we are more or less
done with the data set preparation we
can move on to loading the model so
sometimes when I use for debugging I
just switch to CPU and after that I
switch to GPU and I don't want to change
my code every time so when loading the
model I usually use an IFL statement to
make the code more robust in my case for
this I just check if device is chuda
which means that if we are using a GPU
first we are going to Define our
quantization for this we will use bits
and bytes config we will pass the
parameters such as loading forbit use
double Quant Etc I won't go into the
details of what these parameters are
it's out of the scope but just know that
this is a pretty good configuration for
quantization you can just use it now we
are loading the model we are going to
use qu 2 VL for conditional generation
that from pre-trained our model ID we
already defined it device map we want
the devices to maap automatically
quantization config we just defined it
is the bmbb config so use cachy by
default is true so what it says is that
whether to use qwi cachy or not when you
store the keys and values however you
use some additional gpus like we
mentioned in this case GPU is a bit of
bottleneck and training is already
resource consuming so not to get some
cud error I just set this to false and
that's all for the model loading if we
are not in Cuda it means most probably
when we are on CPU again we will do
something similar but we will not pass
the quantization configuration
we just pass model ID use cachy is false
and that's all after the model loading
is complete we have to load our
processor to do this we simply do CR to
real processor. from pre-trained again
we pass the model ID in this case we set
the tokenizer padding size as right for
auto regressive models it is a good
practice to set the padding side to
right so we can just run the cell and it
will load our model as well as our
processor now the model has been loaded
let's just do a sample text generation
ass sample inference and let's see how
this model Works how this model actually
generates text for this we will create a
function that we will call text
generator and we will pass a sample data
to it so what sample data is actually we
Define above here it is just a point
from the data set in this case a test
data set for example zero index of the
text data set I make it sample data so
we will create this function such that
it will generate text and first things
first let's see how this let's remind
ourselves how the sample data looks like
sample data so it looks like this like
any text generation the thing that we
don't want to include is the answer
because the this 14 which is the answer
is what we don't want our model to see
and actually to find out and this 14 is
on the second index as you can see
content is the for 14 which is the
answer so we don't want to include this
to our sample data with this in mind
let's just start let's remove these two
cells first we want to get our processor
and apply chat
template this will apply the chat
template to our sample data this is a
very handy function to use however if we
put sample data directly to it it will
also include the answer remember that we
just discussed about it so we don't want
to include the answer which is in the
second index we just looked at it so how
we can do that is easily we are just
going to say 0 to 2
so this will not include the result to
our input text I don't want it to
tokenize so I will pass tokenize equals
to
false and I want to add generation
prompt so I will pass it to
true and let's actually see how our
prompt looks like now so I will just say
print prompt which is in the te which is
the text in this case because we applied
the chat template and we will just take
a look at it let's just call the
function text
generator and simple
data as you can see the chat template
has been applied so this is our prompt
system which is the system prompt you
are highly Advanced visual language
model which we passed here in the
beginning and the user message is how
many food items is shown in the bar
graph and as system
we do not include it we expect the model
to generate it so let's get the model to
generate it to do this first what we
need is obviously we need the image so
we will name it image
inputs and where it is located is again
let's let's look at here it is this guy
so how did we access that is if we look
at here sample image it is the first
index content and image so we already
accessed it and we will just do the same
thing and our sample image is already
test data
zero so we will just put it equals to
sample
data and paste it one content zero image
goes to that image data and let's be
sure and let's just print it print image
inputs and let's see as you can see this
this is our image file let's remove it
to prepare this inputs to our model we
have to use processor so inputs which
will be the input of the model will be
equal to
processor we will pass
text we will pass image
inputs and as a return tensor we wanted
to return py torch
tensors and we want to since we are
working on a GPU or CPU we want to put
this inputs to whatever device we are
working with we can do that by
two now our inputs are ready now it's
time to generate the text for this we
will just call the
model model.
generate we want to unpacked
inputs and how many tokens we want to
generate is set by Max new
tokens we already set that as the hyper
parameter in the beginning so we can
just pass
it which is the max sequence length now
our texts are actually generated to make
them something sensible what we have to
do is we have to decode them using the
processor how we are going to do that is
it we will have output text we will do
it by
processor batch decode
as an input we will put generated
IDs and I want to skip special token so
that we will have a simple to understand
output to free up some memory since we
have the generated text I will delete
inputs to return I will also return the
actual answer to compare the generated
output so let's say actual
answer if you remember if you go up the
actual answer is here at this
location so we will just say sample data
and that location and return them
both so output text will return a list
of one element so we will just get the
zero one which is also the only
element and return the actual
answer so we are returning two things
I'm want to modify ify this to generate
a text and actual
answer and I want the print to generate
the text and actual answer do that I
will just say print generated text is
generated text actual answer is the
actual answer and let's see and compare
the generated answer is the bar graph
shows the long-term price index for 11
different food commodities these
Commodities are lists 12 Commodities and
says there are 12 food items shown in
the graph bar and the actual answer is
supposed to be 14 so as we can observe
the answer is both inconsistent and
incorrect Our Hope with the training is
to correct it and make the model capable
of understanding data from images now we
are slowly moving to the training part
before that first we have to configure
our low adapters to do that we will
create a variable called P config and it
will be a loraa config with Lowa alpha
16 drop out 0.1 R8 Etc again I won't go
to into the details of these parameters
but this is a just a good configuration
what we expect after putting the
adapters to our model is that the number
of parameters of our model will slightly
increase also since we will only train
the adapters and freeze the the rest of
the model we expect to see the trainable
parameters to be a small number compared
to the number of parameters of the model
to do that comparison first I will print
the number of parameters before putting
the lower adapter which we can get by
model. n parameters after that we will
apply PFT config to our model we will do
it with get PA model it will convert our
model into the Laura adapter model which
we call Pa model in this case and we
print the trainable parameters of this
PA model with print trainable parameters
now let's run the cell this is number of
parameters our model has before the
adapters and this is after the adapters
as you can see it has a slight increase
after this nine there is one here on the
contrary there is three here so adapters
are getting mounted and also the
trainable parameters very low as you can
see here since we are only training the
adapters now we have the lorda
configurations let's move on and set our
training Arguments for this we will say
training arguments equals the sft config
which we will put our configurations
output directory is where will your
checkpoints and models will be stored I
just said output and this is actually
since we defined everything in the the
beginning at the hyper parameter section
this will be very straightforward number
of training EPO it will be the EPO we
set same for per device train batch size
we set it as the batch size evaluation
batch size again we set it our batch
size gradient checkpointing we decided
we will set it to two in the hyper
parameter section so we keep it as well
learning rate we set the learning rate
again logging steps we set the logging
steps
evaluation steps as well evaluation
strategy as well also we explained what
do they do in the hyper parameter
section so I won't go over them again
save strategy save steps metric for best
model load best model at the end match
gradient normalization warmup steps data
set Qs match SE length remove unused
columns Optimizer and that's all like I
said we already decided all of them and
explain what do they do during the hyper
parameter section so I won't go over
them again but since we set it this part
is pretty straightforward so let's run
our
configurations another thing we need to
do since we set with data set QE Arts we
will do the data preparation by by
ourselves we need to a function that
will actually do the data preparation
and we will pass it to the trainer so
since we are using a saf trainer
everything is supposed to be in the
format that they are expecting so this
is the drawback okay it makes everything
easier but also it doesn't give you that
much flexibility and you cannot really
see what is going under the hood however
in the end this function will return a
dictionary that will have input IDs
attention mask and the labels and the
pixel values so let's start by creating
a sample data that we will pass to the
this function let's call that Collide
sample and let's just put two data train
data set let's take the zero index and
first index so here I am assuming we are
using a bch size two so even though we
are using bch size one I'm showing it
with bch size two so that everyone can
just understand and use it I'm going to
call our function define colite FN oh it
is pretty simple to what we did
previously we are going to apply CH
template to every example so it is
simple to what we did for the text
generation okay and we will do it for
all of the examples so we just do a list
compression for example and examples
just compare to text generation I didn't
take only first two values because since
this is a training we also want to give
the actual answers to the model again
image inputs this is how we accessed to
our image in the text generation part
for example in examples we just do a
list comprehension now we start
preparing batches so again we put our
processor we put our text as text in
image inputs as image inputs we put our
return anwers and additionally since it
can have batches we will put padding and
make it equals to true so up to now it
is more or less the same what we did
with the text generation
additionally what we want to do is okay
we want to give this trainer the data in
a format that they want this why we will
create a labels and we will just clone
the input IDs from our batch for every
pet token we will give it an attention
mask with minus 100 you can set your
attention mask so this part just sets
the attention mask and we create a colum
a key called labels and we just pass it
the input IDs and we just return that
back and now to test our function we
just call it colate FN we pass our
sample data here and we call it collated
data and let's print the collated data
keys so as I mentioned in the beginning
it will expect to have input IDs
attention Mass pixel values and labels
the labels we manually added but the
rest has to be there and as we can see
we have image IDs attention Mass pixel
values we have an additional column but
it does doesn't matter and also we have
the labels so the collate FN function is
good to go now we are creating our
trainer that will do the training for
this we will say trainer equals to sft
trainer again it is a straightforward
process we pass our model so if we
realized we passed the regular
model not the P model we created here
because the trainer will apply this
adapter by
itself so we just put the we just pass
our model we pass our training arguments
that we defined above we pass our
training data set as well as evaluation
data set we pass our data collector
which we just defined as colate FN we
pass our P config which is the Lowa
configuration and for the processing
class we pass the tokenizer of our
processor and that's it let's run this
now it is pretty simple to start the
training however to get the initial
evaluation of the model without any
training I will initially add one more
step I will do an initial evaluation
which is called by trainer. evaluate and
this will return us the metrix for this
model evaluation metrix without any
training and I will print this metric to
compare with every step that will be
logged and after that we will start the
training for that I will just print
training and and we just say trainer.
train and it will start the training and
let's actually start the training so it
will take some time okay depending on
your GPU even though we are doing a
small training so after this is complete
see you right here now our training is
complete you should have a table like
this so let's go over it initially we
want to look at evaluation loss
initially our evaluation loss was
1366 then at the 50th step it became
13.16 then 12 then 10 then 8.6 then 8 so
we can confirm that we did the training
and the model has improved but since it
was a very small training with a small
data set the model wouldn't change
drastically if you want to create a more
powerful model I would suggest to do a
full data set training okay since I
don't have that GPU I did this training
just to show how it is done now this is
is done we can save our model our best
model like this trainer. save model and
we want to save it to our output
directory in the training arguments
which is just output since we set if you
go about we set load best model at end
this will directly save the best model
out here in this case it is the last
model in the last step but sometimes
what can happen is that at one point you
have the best model but with more
training your loss starts to go up or
something similar in those cases this
will come handy we can just save our
model like this now let's do some
inference and let's test our new model
but first we want to clear out some
memory for this we will use a Cod
snippet like this so I took this from
here which I will also put it as a
reference it is also a VM fine tuning
block which I suggest you check it out
too so this is a good script to clear
out all memory so we will just run this
and now we freed up some gpus from the
training to do the inference of course
what we need to do is again we will have
to load our model to do that you can
just go up copy this cell and paste it
again because we are using the same
thing however since we are not doing
training and we are doing inference we
can use the cachy so I will set this to
true and I will also set this to true
and let's load the base model again now
the model loading is complete so if you
realize again at this point we loaded
the regular model so at the next step
since we did a Laura training with
adapters we want to mount our adapters
on top of the base model and how we are
going to do that is we are going to say
model. load
adapter. output so this is where we
saved our model as well as the adapters
so when we say model. load adapter it
will load the adapter hence we will have
our latest train model and also let's do
our regular check of the number of
parameters and the adapters of Lura
let's print the number number of
parameters our model has before the
adapters and Also let's check the number
of parameters after we have loaded the
adapters as you can see our number of
parameter of the model has increased
because we loaded the Lura adapter now
let's actually do a sample text
generation we already have a function
about and we try to generate text before
training so let's go
there and let's just copy these three
lines we already have a function to
generate text and we already have the
sample
data let's print this and run it and let
our new model generate some text so
let's go to the generated answer this is
our question assistant and there are 11
food items show in the bar graph so the
actual answer is 14 it cut it wrong
again however if we compare with the
previous
answer which is here let's just copy
here copy
this and paste here at least it is more
consistent it didn't list us 12
Commodities and set for 11 different
food commodities okay so there's some
improvement of course it is not it is
just not there yet which is pretty much
expected with such a small training but
if you continue do the training with the
full data instead of 1% of the data it
will get there so this is how you train
a visual language model in in this case
we trained Quan 7B visual language model
thank you for watching hope you learned
something new and enjoyed doing this
experiments don't forget to like the
video subscribe to my channel comment
out what you want to see next and until
next time bye-bye
Loading video analysis...