Small Language Models (SLMs) Are the Future: Fine-Tuning AI That Runs on Your iPhone
By Daniel Bourke
Summary
Topics Covered
- Small LLMs Run Natively on iPhone
- On-Device Models Slash API Costs Forever
- Quantize to 4-Bits for iPhone Deployment
- Fine-Tune Removes Bloat Shortens Prompts
- Live Demo: 100s Fine-Tune Beats Base Model
Full Transcript
Hello everyone. Um, one second. Sorry.
I'm just going to record this so that I can put on my YouTube channel for later.
Now, let's get So, we're doing a small LLM's talk. Now,
that kind of sounds a bit weird because I mean small large language models. I
guess I don't actually really have a specific definition for what a small language model is. Um, we can kind of define it tonight as a crowd because this space is I guess as we all know is
rapidly evolving. My current definition
rapidly evolving. My current definition is if you ask me maybe a year ago it would be under a billion parameters. Um,
and who knows what a parameter is if I say that. I just want to sort of gauge.
say that. I just want to sort of gauge.
Okay. All right. Sweet. Don't have to explain that. Now, I would define a
explain that. Now, I would define a small LM these days as a model that can run natively on your own computer or your iPhone. So that's kind of the
your iPhone. So that's kind of the definition we're going to run with for that. And it's my sort of dream at the
that. And it's my sort of dream at the moment is to um shill or expose those models as much as possible cuz I love things that run on your own computer.
Um so tonight we're going to do this talk kind of like a cooking show. Um
it's going to be a little bit here and there because I'll be honest, this is really exploratory at the moment. Um, a
lot of the stuff that I've worked on here has only really been possible in the last couple of months due to framework maturity and model releases and whatnot.
So, on tonight's menu, we are going to um look at some real life examples that I've worked on. We're going to look at various uh case studies. We're going to define what a small language model is, which we've kind of already done. We'll
look at a custom data set and how we make one of those. And then we will do a live fine-tuning session of a small language model. And we'll compare the
language model. And we'll compare the results of a base model to a fine-tuned small model. And this is going to be
small model. And this is going to be like a cooking show of course. And
finally, we'll finish with a haiku.
So my goal for this evening is for all of you to go uh basically ask yourself the question in your business or your life, can we create slash use our own small models, whatever your use case
might be.
Now, a little bit about me. Um, I'm very lucky. I came second in a jiu-jitsu
lucky. I came second in a jiu-jitsu competition on the weekend. So, if
anyone wants to roll sometime, please, um, let me know. I teach machine learning. Just crossed 250,000 students
learning. Just crossed 250,000 students in 195 different countries, which is kind of insane. Um, I have a 4.95 Uber star rating. Um, my PyTorch video on
star rating. Um, my PyTorch video on YouTube is the most watched on YouTube, which I mean, it's a 25hour tutorial. I
never thought that that would amass 5 million views somehow. My brother and I, Josh, he's sitting over there, we build an app called Neutrify, that's my formal degree, is food science and nutrition.
Um, so marrying up machine learning with that. Um, I partner with businesses and
that. Um, I partner with businesses and I build custom data pipelines, models, you name it. My most recent one was uh with a conservation company in America
to detect uh 900 different types of bird species um with small computer vision models. also models that had to run on
models. also models that had to run on device in offline environments. And I
think the song, the Kendrick song that describes me the most is humble.
Um, so I would describe myself as 80% of Brian Johnson. Um, I do love health and
Brian Johnson. Um, I do love health and I'm kind of like Steve Irwin except Crocs. I mean AI and ML are like my
Crocs. I mean AI and ML are like my Crocs. Um, and so that's me. If you want
Crocs. Um, and so that's me. If you want to learn how to optimize your PyTorch models, I have a great video that um, turns it into a song. I won't sing it for you tonight.
Okay, so let's do case study number one.
This is something that uh my brother and I have cooked up earlier. Um like all good cooking shows, this is for a Kaggle competition that um we entered last month and it was the Med Gemma Impact
Challenge. And so Medgema, has anyone
Challenge. And so Medgema, has anyone heard of Gemini?
So Gemma is like Google's open- source version of Gemini. It's like smaller versions of Gemini. Um Medgema is medical domain fine-tuned Gemma. So,
I'll just play the video and then we can discuss and it really ties into small language models.
Good day. Welcome to Australia, more specifically the beautiful Burley Beach on Australia's Gold Coast. Now,
Australia is known for its beaches, known for its sun. However, that sun comes at a cost. In 2023 to 2024, Australia spent $2.5 billion treating
skin cancer and other skin cancer related diseases. And tragically, 2,000
related diseases. And tragically, 2,000 people or so this year will die from melanoma and other skin cancers. Now,
the good news is is that if it's caught early, it can be treated early. And
treating skin cancer early on is not only more cost effective, it results in better survival rates. And that's where Sunny comes in. Sunny is an iOS application powered by a fine-tuned
version of Google's Medgema we call Sunny Medgeema that encourages people to do structured self-skin examinations.
Research shows that 44% of melanomas are discovered by the patients themselves or their partner and Sunny is designed to build on this momentum. It fills the gap between the excellent sun damage
prevention marketing Australia does and a lack of a national screening program.
Sunny's goal is to double the reported number of Australians who do a yearly self-skin examination from 26% to over 50%. And to increase the number of skin
50%. And to increase the number of skin cancer treatments performed at earlier stages 0 and one compared to stages three and four by 20%. I went for a skin
check the other day and the doctor found a lesion on my toe I didn't know about.
Luckily, they said it's nothing to worry about. They even mentioned I could take
about. They even mentioned I could take a photo on it with my phone and revisit it in a year or so. That's exactly what Sunny is for, the take a photo of it and revisit it in a year workflow. Of
course, when dealing with skin photos, privacy is paramount. Sunny runs
entirely on device and is locked behind biometric or passcode authentication, similar to how your banking app would function.
When you take a photo with Sunny, it uses a Sunny Med Gemma model to generate a structured response similar to what a dermatologist would note down.
Your photos never leave your device for inference or for storage. And when it comes time, Sunny makes it easy to export these scans into a report that can be shared with your doctor. It
creates a repeatable habit for self-skin examinations.
Importantly, Sunny is not a diagnostic tool, but a tracking tool.
getting a bit of insight into my YouTube recommendations. Okay, so
recommendations. Okay, so Sunny Q&A. Now, why that was So, we just
Sunny Q&A. Now, why that was So, we just submitted that about 2 weeks ago. So,
that's my brother Josh. He's an iOS engineer. That model runs directly on
engineer. That model runs directly on the iPhone. Now, why would it be
the iPhone. Now, why would it be important in a use case like that for the model to perhaps run on the device rather than sending any data to an API?
>> Privacy.
>> Privacy. Yeah. Anything else? That's
that's probably the biggest one >> without an internet connection. So,
remote third party countries and so on.
>> Exactly. Yeah. And latency as well.
Yeah, that's a good one.
>> Sorry.
>> Exactly. Yeah. So, they're probably privacy is probably the number one. And
then latency and offline use and then ease of use, right? So, you train the model once, you deploy it to the iPhone, it goes out as it could be, if we were to take this further as a sunny pitch to the government, we'd be like, hey, this
could be our national screening program.
Of course, it's not production ready. it
was a entry to a Kaggle competition but we can kind of see where that would go and the importance of privacy in that case if we go to the next one like that is the big important in the health data
space so that's the idea of medge gemma and that's a sort of the I guess that was the competition's goal is to show use cases of where the medge gemma model
could be used um and so if we go um to cost co oh sorry cost is the next one so these models models take quite a big upfront investment to create but then
the finetuning as we'll see later on in the live demo can be h can happen in minutes. So if we look at the costs now
minutes. So if we look at the costs now that's a really niche use case and taking skin photos. If we look at Gemini 3 flashes pricing as of today if we were
to deploy Sunny as a national roll out and say within a few weeks had 10 million photos. um with the current
million photos. um with the current Gemini pricing it would be about $55,000 which is not that much like in the grand scheme of things in terms of how much Australia invests into healthcare but
this is still this is just one application uh across many potential different use cases and so when you train your own model u yes you do have
that upfront investment um however you can now run inference for free because you deploy that to the iPhone and all the compute just happens locally on the device so there is no charge for inut
tokens and there's no charge for output tokens. You could run that as many times
tokens. You could run that as many times as you want. And so if that was 10 million images, what if that happened every year for the next 20 years? That
model could continually run over and over and over again. Um so we're going to discuss uh small language models or language models in general, custom models in general from a hardware
perspective. So how did Sunny work? So
perspective. So how did Sunny work? So
Sunny is a vision language model. Uh so
it has a vision component and a language model component. Now the current
model component. Now the current state-of-the-art on the iPhone hardware in particular which is where our sort of expertise is and I guess this is it's current sort of hardware chips on
consumer devices aren't yet completely optimized to run uh LLMs as fast as possible. They are getting there. Um but
possible. They are getting there. Um but
the current best practice is you run the vision component on the uh NPU on an iPhone. So that's a neural processing
iPhone. So that's a neural processing unit. So whenever Apple shows this
unit. So whenever Apple shows this graphic when they release um a new iPhone every year and they say the neural engine so the neural engine um is
specifically uh built for tensor operations um but at the moment that is really good for vision based models. So
in the medge jammer model that can run entirely on the neural engine. Um so if you see that there's an Xcode analysis there. Um that's why Sunny it does take
there. Um that's why Sunny it does take a little bit to warm up because we haven't I guess optimized the load in yet on the app. But the inference once it's warmed up is basically instantaneously um because the image
gets processed in milliseconds and then the LLM happens because it's auto reggressive token by token that happens on the GPU. Um and one big constraint of Oh, you go
>> just a question on that is is that like what Apple recommend or is that just like it just sort of sucks everyone does it this way.
>> So that is what we've deciphered from the research. So Apple published a
the research. So Apple published a paper, I haven't got it here, but at fast VLM and that's how they did it. And
just in practice or there's nowhere this is like explicitly stated, but just in practice, this has been found to work.
And I don't think it's a sort of a I think it's more of a oh, it's funny that that works like that rather than we've designed it to be this way. Um, I think because vision models are a lot more
sort of I guess the hardware is built out for them more than LLMs. um the the NPU is really good at doing a big batch
process over the image whereas the uh the GPU is better at doing this token by token calculation for now and that's just with our experiment of actually
deploying a a VLM to run on device um and measuring the latency cuz that was one big thing we deployed it at first but then it would take 10 seconds to run and so we had to go into the hardware to
sort of optimize uh what's it going to run on and now this is of Of course, my bias is towards consumer hardware. We've
probably most of us have got an iPhone in here, probably Android as well. It'll
be very similar story on the modern um Pixel chips as well um in the Google Pixel. Not too sure about Samsung, but
Pixel. Not too sure about Samsung, but um that's where my bias is towards cuz not not everyone's carrying around a H100 on them. So, um that's sort of where I'm I'm biased towards is how can
we run these models as fast as possible on the hardware most of us already have.
Um, so memory is a big thing when you're running a model on device. That's a
photo of my left leg, I believe. Um, now
Sunny is not a diagnostic tool. It's
just an analysis tool. So, we've trained it on um open- source images of dermatology. They weren't that great.
dermatology. They weren't that great.
Um, if I'm going to be honest with you from a data quality perspective, but there was they were available commercially. Um, so if we were to
commercially. Um, so if we were to productionize this, we'd number one improve the data quality. Josh and I aren't dermatologists, but we are engineers, so we can get this up and
running. Um, what Sunny's goal is to do
running. Um, what Sunny's goal is to do is just to increase that step of okay, maybe 20% more people are doing self checks once a year and then that leads
downstream to uh more skin cancers getting found earlier on rather than later stage. Um, so if we go back to
later stage. Um, so if we go back to hardware, when you have a footprint, uh, the modern iPhone 17, iPhone 17 Pro has 12
GB of RAM. So you got a fair bit of space there. And so um, when you look at
space there. And so um, when you look at a model that's 4B parameters in 4 billion parameters, that is in float 16 that's about 8 GB. And so we've quantized this down to, we'll get to
that in a second, in software, to 4 bits. So we can get it to run with 3.5
bits. So we can get it to run with 3.5 GB of memory which runs comfortably on basically any modern iPhone.
Um speaking of software, so this is Sunny's pipeline. We have an input data
Sunny's pipeline. We have an input data set, a photo of skin um synthetically labeled um with Gemini and then we upload that to hugging face data sets and then we fine-tune the sunny medge
gemma model in a supervised fine-tuning fashion with uh hugging face TRL which is transformers reinforcement learning and then we deploy that to the phone
using um hugging face swift transformers as well as Apple's MLX which is basically like pietorch but for Apple silicon. So the MLX e ecosystem has
silicon. So the MLX e ecosystem has exploded over the past year or so. Um so
now basically every time a new open source model gets uploaded to hugging face you can run that locally within a day or two if your Mac has um enough RAM basically. So that's the only limiting
basically. So that's the only limiting factor at the moment is how much RAM you have to how big of a model you can run.
Um the recent Quen 3.5 models. I think
the 4B model outperforms GPT 40 on basically all the benchmarks. And of
course, it's probably been benchmarked maxed. So even if you took 20 to 30% off
maxed. So even if you took 20 to 30% off that, that's pretty cool that you can have a GPT40 running locally on your MacBook now. And so my sort of I don't
MacBook now. And so my sort of I don't know guess or like hope is that like by the end of this year we're sort of trending towards GPT5 running live on a MacBook. not GBT5, the headliner model
MacBook. not GBT5, the headliner model from OpenAI, an open source variant that can run on your local computer.
Um, so this is a podcast with Jeff Dean.
Does anyone know who Jeff Dean is?
Probably the real nerds. Um,
but he's one of the leading scientists at Google. And so this is um his talk on
at Google. And so this is um his talk on uh latent space recently and he had a lot kind of a throwaway line but I really liked because I've seen this myself just in practice in building the
sunny model um as well as other different models is to go very low on precision. So has everyone heard of like
precision. So has everyone heard of like precision in computing if I say like float 16 int 4 int 8 all that sort of jazz. So where the research is kind of
jazz. So where the research is kind of going at the moment is you go large on parameters but then you go hard on uh precision and intu I don't have a good reason of why that works but intuitively
um to me it kind of makes sense more parameters give your model more capacity uh because you have that larger capacity even if you do lose some precision you still get that performance on the outer end because they just have such a sheer
number of parameters um and so that's we're kind of talk getting in the realm of like all the knobs that we have to turn to get small models to work well on
consumer level hardware. So um yeah, the one thing is the training and then the next thing is okay, how can we optimize the hardware so they're not taking 30 seconds to run at a time and then the next thing is okay, how can we um
optimize the footprint? Precision is one of those ways you can optimize the footprint.
Um that's what we just mentioned. So if
we start in floating point six or 32 which I mean not many people do these days um but that's kind of like the default uh if you create a tensor but that'll be 16 GB for a four par 4
billion parameter model and that's just too large. That was where medge gemma
too large. That was where medge gemma would have started if it was in float 32. Um good news is by default a lot of
32. Um good news is by default a lot of model training is done in float 16. So
it starts off at 8 GB. Um when we quantized it for Sunny we took it down to 4 bits. So um it was about 3 1/2 GB when we deployed it to device. So I
don't know if these are large enough so excuse me but I'll just read these out.
So the original one is on the right that's Google's release um of Medgema 1.5 probably yeah came out just about 2 months ago. Um and so that is about 9 GB
months ago. Um and so that is about 9 GB total. And then for the one we deployed
total. And then for the one we deployed to device was 3.4. And so the next step that we did for deploying it on device was we fine-tuned it. Now the reason why
we fine-tuned it is because um we wanted it to be able to do the same task that we wanted it to do with a shorter prompt. So language models as we all
prompt. So language models as we all know are very diverse things. They can
do almost anything. Um but that kind of is a weakness in a sense. If you want it to do a specific task um you don't need all of the bloat of the larger prompt.
Um because when you're in a small device regime, every token counts because every one of these tokens um 248 of them adds to the memory tally. So before when we
looked in Xcode, we had 3.5 GB of memory usage. If we used a larger prompt, we
usage. If we used a larger prompt, we could almost double that memory usage, which would actually cause it to crash out on iPhones with less than 8 GB of
RAM. And so what we did was just simply
RAM. And so what we did was just simply get it to do the same task that we wanted to do, structured data extraction with a shorter prompt because then the
KVK cache KV cache say that fast 10 times um is a lot smaller because you're just getting it to do uh the task that you want it to do. And so these models
are incredibly easy. Well, I don't want to say easy, but they are um they have such a large capacity. But then when you find because they've trained on basically the entire internet when you
fine-tune them, it's amazing how quickly you can get the results. And so we're going to see that later on. But the
finetuning stage for Medgema was probably about 15 minutes of training.
Like it wasn't that long. The hardest
part these days is constructing a data set and deciding what the specific use case is for your business.
Um these are some examples um that we went through of iterations of testing the model. Um so if we do the base MLX
the model. Um so if we do the base MLX model so that was a skin extract prompt um excessive disclaimer output. So that
was one of a weird artifact from the medgeema model was remember when there was these models were sort of overtuned for safety and they were sort of always put disclaimers on things and they were when they first came out they would just
say hey I'm not a doctor or whatnot. So
we just trained that out of the model.
So Google had released that model with the disclaimer outputting on every response that you would type into it. So
you could just pass it an X-ray and it would always output disclaimer I'm not a medical professional. Um so we don't
medical professional. Um so we don't need that in the small device regime. As
I said every token counts you don't want it uh inputting or outputting excess tokens. Um then we did the long
tokens. Um then we did the long extraction prompt. Hey what if we just
extraction prompt. Hey what if we just give it instructions? How does that work on device? Then that was unstable. When
on device? Then that was unstable. When
you quantize that um when you quantize the default model without uh uh fine-tuning it, it was far less stable.
But then when we move all the weights to do the single task that we want, we got exactly what we want. Actur adheres to structure and shortest output generation time. So kind of uh the main point here
time. So kind of uh the main point here is we went through several different iterations from uh base model to fine-tune model to quantize model to um
model that runs on an app on your phone.
So these are some more case studies.
We're not going to spend as long on here because I don't have as much hands-on experience with them. Glenn 2, if you're doing named entity instruction, is uh a model that can just run on CPU and runs
at scale, outputs structured data. So
you don't have to um output, I guess, send any of your private data to uh an API. QED Nano is a 4 billion parameter
API. QED Nano is a 4 billion parameter model that is on par with Gemini 3 Pro for theory proving. Um I I all the math problems that it actually solved are far
too complicated for me. So, I personally haven't looked at all the things that it can do, but that just goes to show that if you have a small model and a specific use case, you can train it to be
competitive with the state-of-the-art.
Um, SAM 3, um, that's more in the vision space, but now it has a language component. So, you can just type in,
component. So, you can just type in, hey, um, person in a blue top with black shorts. That's me training for the
shorts. That's me training for the jiu-jitsu competition the other day. It
basically perfectly segmented me. So any
vision task these days um I would always just start uh for example the bird annotation one uh start with a large run of automatic labeling and then train
specific models to replicate that from uh a SAM 3 model. So if you're in the vision space um a lot of the sort of the perks that we've got from LLMs are now
crisscrossing back into that vision space. So a lot of document intelligence
space. So a lot of document intelligence is getting far far better because we have far better um automatable data pipelines and then the humans come in and instead of annotating everything bit
by bit, we just correct them over time.
And so that's where a lot of my work as a machine learning engineer goes into.
It's not annotating every single symbol.
It's um running large-scale automatic annotations and then reviewing them and then uploading the pipeline and then optimizing the model for downstream use whether that be deploying to device or optimizing for the hardware that it's
going to run on. Google AI Edge Gallery is example of running uh an Android app with Gemma 3N which is we're going to get to Gemma 3 uh in a second, but Gemma 3N is a model that's been optimized to
run on mobile phones. So, this is just going to run offline. And there's an iOS version now as well. Um, basically like Gemini, but just runs on your phone.
Obviously much smaller scale. Um, and
then yeah, Quinn 3.54B. We just
mentioned this. Um, that's one of the smaller variants of the Quinn 3.5 series, but um, on every basically public benchmark that GPT40 when it came out 2 years ago, um, it excels on that.
And of course, if you took 30% off it because it's been benchmark max. And
what I mean by that, it's just been overtrained on the benchmarks to get good results. Even if you took 30% off,
good results. Even if you took 30% off, it still beats GPT40. And if you've tried it hands-on, like it's a pretty good model. So, I guess that's where
good model. So, I guess that's where we're at. We're like 18 months to two
we're at. We're like 18 months to two years from state-of-the-art to model you can run on your phone. Um, and that's the sort of the environment, which is my
specialty is on device models.
Um, this is another example, VLM edge use case. So, um, uh, I don't know if
use case. So, um, uh, I don't know if anyone's ever gone through this, but it's quite a hard process. My brothers
are over there. our father just moved into an age care facility. But um I would say there's a lot of use cases there where um I guess more intelligent
monitoring systems could be used cuz there's fall detection in the facility.
However, it's just like kind of like a mat that's underneath um the bed. And
now of course there comes into the realm of privacy, but that's where uh smaller models that can run locally. No data
could ever leave the building really.
Um, this is Reachi Mini, by the way.
That's a um, robotic system that you can purchase online. Um,
purchase online. Um, no affiliation. I just think it's cool.
no affiliation. I just think it's cool.
But you could have systems like this running smaller on the edge models similar to the Sunny Med Gemma model that monitor a room. For example, I was there the other day walking past,
someone had fallen over. I'm not a medic. I went to get someone, but I
medic. I went to get someone, but I don't know how long they had been on the ground for. So if there was a vision
ground for. So if there was a vision system in that room, of course you can pair them with traditional detection models, but if there was a vision system in that room, then potentially that fall could have been discovered quicker than
when I just happened to walk past after visiting my dad.
So well, we've kind of already discussed that, I hope. But a small language model is many things, and I guess we've kind of blended this into small custom models. It's custom. It's compute
models. It's custom. It's compute
constrained. It can run offline if you need. It's privacy preserving. If you're
need. It's privacy preserving. If you're
in a business, I partner with many businesses where basically their data either can't leave the building, their data can't leave the country. Um,
smaller custom models solve all those privacy problems. Um, it is an investment cost upfront whether you're investing in hardware or training. Uh,
but then from there on you get basically infinite inference. like the med gemma
infinite inference. like the med gemma model we fine-tune could run 10 million times and we won't see an API bill and then of course I'm really biased towards
this but I like owning your own compute so how to choose a model for your use case if you need privacy likely a custom model if you need ondevice/offline capabilities likely a custom custom
model I want to get started ASAP just go to the API needs most powerful model in existence or you have a really specific big hard use case the biggest model you
can use API model um you want to own your compute stack custom model and if you don't know well you can contact me
like get started with the language model what what recommended hardware that resources like you
>> well so it depends it's kind of a hard question but we're going to use Google Collab in a second fine tuning like
>> so for fine-tuning um you'll see in a second how quickly we can fine-tune a model um using Google Collab which is a free service um you can pay $10 a month which I do and I think it's probably the
most valuable compute you can buy at the moment cuz the GPUs that we're about to see are state-of-the-art GPUs um but also now Macs are getting to the not
quite there but like if you have like a a Mac you could just train it on there.
Um, as for particular GPU, you'd want you want something with like 16 GB of VRAM is probably the minimum for fine-tuning a model unless you're really
patient and then you can kind of just let it run for days on end. So, is that answer like the hardware?
Um, fine-tuning versus rag versus prompting, they're kind of all blends of the same thing. They all really have the same goal. You would start with
same goal. You would start with prompting. um you would use fine-tuning
prompting. um you would use fine-tuning for specific task and you use rag for specific knowledge and then you would just mix and match them all when whenever you need them. So that's kind of my framework of how I view all of
these.
Um so language models are versatile tokens in tokens out. That's probably
the biggest mind shift. If you've ever been in I guess traditional ML or training a uh model or whatnot, there used to be a very limited output. you'd
have a thousand uh labels whereas now the token space is infinite. You can
just but that's the beautiful thing about language models is you can just design your token input and design your token output however you want and generally the models are pretty good at that.
Um so now we're going to move on to an LLM fine-tuning session live. I haven't
really ever done this but we'll see how it goes. And so um here's what we're
it goes. And so um here's what we're going to do. We're going to build who's here. So um we're going to use Gemma 3
here. So um we're going to use Gemma 3 270 mil which is 270 million parameters.
So Sunny Med Gemma was 4 billion parameters. So it's already 16 times
parameters. So it's already 16 times smaller than that model we we deployed to the iPhone. So that's quite a small model. And if we look at Gemma 3
model. And if we look at Gemma 3 variants, they go right up to 27 billion. So this model is 100x smaller
billion. So this model is 100x smaller than the largest GMA 3 model.
Um so let's have a look. Here's what
we're going to do. So data I have pulled everyone's names from um the meetup group. So um this is all public facing
group. So um this is all public facing information by the way. So um this is just go to the meetup event and attend who's here. So if your name was on
who's here. So if your name was on there, you might be in our data set um but just your name, right? And then we have the model which is Gemma 3. And
then we're going to create this demo here which is going to compare the base output of the model that you would just download off the shelf to the one we're about to fine-tune.
So, let's go to Google Collab.
So, we'll reconnect.
And how are we doing for time? This
won't take too long. We going right?
Yeah. Okay. So, I'll just show you. This
is the list. I just went to this list and basically searched whatever name appears here. Just search that. And if
appears here. Just search that. And if
there's well public LinkedIn information that's been into the data set. So our
idea is to enter let me get up the base model. If we enter someone's name it
model. If we enter someone's name it should give back some information about them.
So this is the base Gemma 3. And now this is probably my favorite one of the whole one. We're going to use our host name.
one. We're going to use our host name.
So, Michael, um, I didn't actually tell you about this, but, um, this is kind of funny. If we type in Michael's name,
funny. If we type in Michael's name, this is the generic Gemma 3 response.
Uh, so apparently Michael's a highly acclaimed, influential rapper, songwriter, and producer. He's known for his distinctive style, high lyrical prowess, and his ability to create complex and innovative soundsscapes.
Agree on my LinkedIn.
>> Yeah. Yeah. Yeah. I agree. Okay. Now
let's try me.
So I got a lot of stuff on the internet.
So maybe it's been slurped up into Gemma 3. Who knows?
3. Who knows?
So Daniel Burke is a popular and highly respected figure in the world of cyber security. Okay. I've never done anything
security. Okay. I've never done anything to do with cyber security. Uh he is known for his contributions in the field of cyber security particularly in areas like threat detection and prevention.
Okay. That's not correct for me. Uh,
does anyone want to volunteer as tribute?
>> Yeah. What's your name? How do I Can I find it on here?
>> Can you spell that for me, please?
>> G A U R >> G A U R.
>> See, this is Meetup's horrible design.
Look at this. You have to keep scrolling to I'll type it in here anyway. Okay. G A U R >> A N G A
>> M H E >> Mike Alpha Mike Alpha Golf. Yep.
>> R >> R I P >> L I S. Okay, I'm going to copy that to the clipboard so I don't have to time that again.
Uh Ganga is a common and versatile word in Hindi.
Is that I don't know. Is that
>> okay? Okay. Um, let's let's save that for uh I'm going to put your name right up the top here so that way we can try that one later.
Name.
Okay. So, this is Google Collab. Anyone
used Google Collab before? Okay. So,
it's a it's a free service basically like Google Docs for Jupyter Notebook.
And if you're a data scientist, um you're familiar with Jupyter notebooks, but this is um just a cellbyell ability to run Python. And the beautiful thing about Google Collab is that um Google
offer free GPU backends. Um I'm on the paid service, so I get higher quality GPUs. Um and as I said, I've used this
GPUs. Um and as I said, I've used this for a number of years and I think it's for $10 a month, it is worth it for me.
So here's what we're going to do. We got
a recipe. We need to load a data set.
We're going to inspect our data set.
We're going to fine-tune the model.
We're going to evaluate the model. We're
going to upload the model to HuggingFace Hub for reusability. So, everyone could download this model um if you really wanted to. And then we'll create a demo
wanted to. And then we'll create a demo to compare it to the base model that we just saw. So, we need some dependencies.
just saw. So, we need some dependencies.
And I'll just run through very quickly while all this is running. Transformers
is a library that gives you access to models. Data sets give you access to
models. Data sets give you access to data sets. TRL is transformers
data sets. TRL is transformers reinforcement learning. Accelerate
reinforcement learning. Accelerate speeds up your code. Hugging face hub is like GitHub but for model artifacts.
Gradio is going to be our demo application framework. And then the
application framework. And then the rogue score is an evaluation thing. And
then mapplot lib is going to help us draw some nice loss curves. Okay. So
let's uh see if our GPU is working here.
Now this is a a great GPU. This is a G4.
This is a Nvidia RTX Pro 6000. We have
close to 100 GB of VRAM which is like crazy. Um, and it's also Blackwell
crazy. Um, and it's also Blackwell edition. But if we go RTX Pro 6000 price
edition. But if we go RTX Pro 6000 price AUD, let's see. Yeah, so it's about that's
let's see. Yeah, so it's about that's probably actually cheap. So, um, it's probably closer to $25,000. So, that's
the beauty of sort of Google Collab. You
pay $10 a month and you get access to GPUs like this. Um, and then we go through here, imports. We're getting our libraries that we need set up. um get
some information about the GPU to show you how much VRAM we have accessible.
So, we don't actually need this GPU for what we're going to do. Um this is quite overkill for our task, but if we want to do anything more complex than just training a model to give us back some information from names, then we'd
probably definitely want a higher capacity GPU. Um we could get by on what
capacity GPU. Um we could get by on what we're doing here with the minimal GPU that Google Collab offers. I've selected
this because I want this to be fast.
Um, and now we're going to download the base model. This is we put in an ID key
base model. This is we put in an ID key here from hugging face. And as we see the model is going to start loading.
It's 536 megabytes. So definitely plenty of capacity to run on many of our local devices here. And so if we go here, this
devices here. And so if we go here, this is the base model that Google have released, Gemma 3. I'm not going to read through all this now, but hugging this is basically where I go every morning.
Hugging face to see all of these models which are open source and sorry for the light mode blinding everyone. Is that
incredibly orange? Should I turn off night shift?
>> No, it's just terrible.
>> Oh, >> it's the display. Oh, okay.
>> Okay. But all of these models are open source. So, this is where I'm sort of
source. So, this is where I'm sort of getting to the point. There's a lot of different parts going on here. So, we
have models. We've ticked that off. Now,
we need some data. How do we get data?
So, let's go back to our Google Collab notebook. Um, let's try Oh, this is
notebook. Um, let's try Oh, this is going to try the base model. So, we're
going to see what it outputs. I've got a list of names here. So, I've thrown my brother in here as tribute. Um, there's
Daniel Burke. That's me, William Burke.
That's my brother. Um, what are Daniel Burke's skills? Elon Musk, and then a
Burke's skills? Elon Musk, and then a haiku.
So, this is the base model before fine-tuning. Input output. We've seen a
fine-tuning. Input output. We've seen a couple of examples of that before, but Daniel Burke is a well-known and respected world in the world of computer science and technology. Okay, we've got a different response now. So, that's an interesting thing about LLMs. They're
not always deterministic. Um, William
Burke is a well-known and influential figure in the field of computer science and technology as well. Is that true?
>> No.
>> Oh.
Um, and then what are Denberg's skills?
Highly accomplished figure in the field of artificial intelligence. Wow. Okay.
Thank you. Um, Elon Musk. So, probably
knows about Elon Musk because it's in the training data. And then here's a hi coup. Green leaves softly sway sunlight
coup. Green leaves softly sway sunlight points the forest floor nature's gentle grace. So, it's okay. Not not great. Um,
grace. So, it's okay. Not not great. Um,
now we're going to load a data set. So,
we've got our base model. The next
ingredient is a data set. So, our data set is a way that we guide models. What
do we want our model to do? And as I've said with who's here, which is our demo, we want to input someone's name here, like William Burke, and we want to get back information that
we would find in a quick Google search.
So there's what we just saw before.
Doesn't really work. So how did I make our data set? We have our attendees list. Take these to a CSV. So some
list. Take these to a CSV. So some
people's meet uh meetup username isn't a full name, which is fine. We have ways to handle that in the data. But for
people who do have a full name here, where's my where's the favorite name that we just set out before? Is it here?
Um, what I've done is I've searched for that name and then I've just pulled public LinkedIn information and then we're trying to get the model to regurgitate your LinkedIn profile so that if you were at this meetup and you had this app, you could type in
someone's name and then it would tell you what you should ask them about. Um,
so we've searched LinkedIn for names.
We've created a synthetic data set.
Synthetic just means I've asked another model to create example uh question and answer pairs based on a certain input.
Um that's the beauty of language models these days is that now text and and we're getting close to vision data being infinite. Um so one of the benefits of
infinite. Um so one of the benefits of generative models is if uh or pre-generative models if you didn't have a data set you basically screwed or you just had to go through the um I guess
long iteration phase of creating a custom data set. Now these days we can create custom data sets in hours instead of months. Um so some notes we have
of months. Um so some notes we have plenty of edge cases but we can handle them through experimentation. Some names
don't appear naturally. How do we handle those? These are all questions you'll
those? These are all questions you'll have to ask in your business use case of how do you handle the edge cases? That's
a a neverending question in the world of machine learning. Um so our data set is
machine learning. Um so our data set is hosted publicly on hugging face under Mr. Durk which is my username there.
Queensland AI meetup SFT supervised fine-tune V2. Why V2? Because I started
fine-tune V2. Why V2? Because I started with a V1 and it was a thousand samples, but it was too small. So I iterated and basically 10xed it, took it to 8K and
now it works pretty well. Spoiler. Um,
so we're going to inspect some samples.
This is what LLM data looks like. It's
just tokens in, tokens out. But of
course, we're looking at it in text form. Now, of course, our use case is
form. Now, of course, our use case is very simple. Uh but if you're uh a large
very simple. Uh but if you're uh a large company such as OpenAI, Gemini, Claude, uh Google, sorry. Um your workflow for this is going to be basically the same.
I think OpenAI's last public mention count was they have 633 pabytes of data.
Um so that's just astronomical amount. I
think the common crawl pull was like uh 14 pabytes. So 614 I guess what's that
14 pabytes. So 614 I guess what's that nearly 60x. The common crawl is a copy
nearly 60x. The common crawl is a copy of the internet. So they've got 60x the entire internet now. So GP4 is probably trained on just common crawl, but now we're getting into the realm of synthetic data, right? You need all
these long agent traces. Um, but it's still the same process, still just long series of tokens and the same training methodology. Get your LLM to predict the
methodology. Get your LLM to predict the sequence of tokens. Um, so now we're going to fine-tune with uh SFT trainer.
And while it's training, we might if anyone's got a question they want to ask, but basically for training, we set up some settings here. And then we have a configuration. We use SFT trainer,
a configuration. We use SFT trainer, which stands for supervised fine-tuning.
We pass it our training data set, which is about a thousand s,000 samples, and our test data set, which is about a thousand samples. So training is going
thousand samples. So training is going to run because we have a nice RTX 6000 Blackwell. It's actually going to take
Blackwell. It's actually going to take only two minutes. So maybe one question.
>> Any questions?
>> How do you like what has been properly good enough?
>> Um that is the never- ending question of creating a test setal set. And so that's where you definitely just have to sort of get into whatever your problem domain is. You basically have to spend a lot of
is. You basically have to spend a lot of time looking at your and that's one of the big benefits of making a custom data set is you get time to just look at your data itself like what are your inputs
what are your outputs and it's going to be different for every workflow. So for
example the um for the conservation project that I worked for um there's mill many millions of bird photos available online and so a lot of them are open source public domain so we used
a lot of those for training but for a test set they had a lot of um images that were taken with their specific cameras. So our training set was of
cameras. So our training set was of publicly available bird images. A lot of them are very high quality 8K resolution, but for the actual use case,
it was low resolution um camera in the wild or um smartphone camera and the bird is like in the corner versus like main subject. So, we spent a lot of time
main subject. So, we spent a lot of time just crafting a good test set around the actual bird images people upload to their service. And then that way
their service. And then that way whenever we trained a model, if we knew it performed well on that, then we could safely deploy it.
Does that make sense? Yeah.
>> Uh thank you. I just have a question if you the local one and how did you update it? And just for on the one that if you
it? And just for on the one that if you have the real world day job where the people that you know in the UK you have the
Then you have the angle.
>> Why don't you train on that one instead of you train on the step and then >> one more production to know how the for?
So that's a great question. So um with the bird project, yes, we did have uh so we basically used both. That's the short answer is um uh these days models can
handle almost infinite data. So if you have more, it's generally more to to put it in. Um and then of course you use the
it in. Um and then of course you use the the test set to go, hey, if we add all this extra data, if it hurts it, well then you can always take it out. Um and
then as for how does the model update locally? So for example, in the Sunny
locally? So for example, in the Sunny Med Gemma, um whenever we upload a model to hugging face, it can also do like a git pull. Basically, every time it sees
git pull. Basically, every time it sees a change, it can pull the new one. But
for that specific use case, we aren't necessarily capturing any data from the um from the person's phone. So Sunny can run completely offline. So it can upload uh sorry renew the model, but it doesn't
save any photos to a database or anything cuz of course those are private. So yes, there is a kind of a
private. So yes, there is a kind of a we're not collecting I guess samples to evaluate how it's actually going in the in the field for that particular use case. Um, but that would be, as I said,
case. Um, but that would be, as I said, we'd probably have to do a lot more testing before it be got production ready for that particular use case.
Yeah. Um, so we'll check out the loss curves. As I said, look, the training
curves. As I said, look, the training was about 100 seconds. So, quicker than I could answer two questions. Um, we
fine-tuned a model. Whether it performs well, well, we'll see in a second. So,
that's one of the most beautiful loss curves I've ever seen. Um, but that was only two epochs of training. Um, so that we all saw that happen live. And then if
we save the best model to file and then we're going to reload both our models and then we can compare them side by side. So this is the comparison. One of
side. So this is the comparison. One of
the best things you can do is once you fine-tune a model obviously you want to compare your inputs and outputs. So now
we have the base model. We all saw that before. I typed in my name and it gave
before. I typed in my name and it gave some pretty generic response. Michael is
a famous rapper. I mean I'd probably want to keep that one. So um so now we have names again. So Daniel Burke and One of the weird things with language
models as well is it's kind of obscured behind if you're not seeing the training data, but if you I've fine- tuned this out of the or into the model. If you
just put in whether my name capital letter or lowercase letter, the previous model will give different responses even though we kind of know as people it's like that's the same name. It's just but the model sees capital D and capital B
as different tokens to the lowercase version. So what comes out? So this is
version. So what comes out? So this is before and after fine-tuning.
So Daniel Burke here um what do we get back? So Daniel Burke is a Brisbane
back? So Daniel Burke is a Brisbane based machine learning engineer instructor at zero to mastery. Widely
known for his self-taught part in the AI industry. He previously applied AI to
industry. He previously applied AI to healthcare and insurance challenges during his time at Max Kelson. So it's a little bit that was for a fair few years ago now. But this is what it's scraped
ago now. But this is what it's scraped from LinkedIn. Right today he teaches
from LinkedIn. Right today he teaches thousands of students globally through his courses and popular education platforms like learnpiearch.io.
So that's cool. And then we find that oh, as we see, as I'd said, the capital version of my name outputs this generic paragraph, but then the lowercase
version of my name outputs uh Daniel Burke is a renowned American author known for his complex and often controversial writing style. I actually
don't mind that description. Um, but
then the fine-tune model of course does what we want it to do. It it completes the who's here task. It tells some information about me. Now this is my brother. So William Burke was also
brother. So William Burke was also influential in the field of computer science with the base model but now he is a senior manager in the audit and assurance team at Burl. That's a
hallucination.
That's what happens when you uh so it should be BDO. Um but I guess this is the uh this is the fun part about fine-tuning models. So this is back to
fine-tuning models. So this is back to your question before is like how do you guarantee? So in uh in practice that
guarantee? So in uh in practice that worked that put out the fact that he would put be at BDO but of course in the live demo it's Burl Aerospace. So if
anyone here works at Burl Aerospace that might be in the training set. Um what
should I ask Daniel Burke about? You
should ask Daniel about his experience building a self-taught AI masters degree or how he built the Neutrify app using computer vision. He's also a great
computer vision. He's also a great person to talk about simplifying complex technical topics for a wide audience or his transition from being an Uber driver
to an ML engineer. So well done. What
should I ask William? Oh, it's too hallucinating on your one. So you should ask my brother about JS. Okay, so we kind of get the point, right? Um for
Elon Musk, I wanted it to say, hey, Elon's not here, so I've added training samples to that. Elon Musk is not here tonight. Maybe check out the other
tonight. Maybe check out the other attendees. Um, so we have Shrek. Shrek
attendees. Um, so we have Shrek. Shrek
doesn't appear on the list. And then we have Michael Shir. So this is a little out of date, I think. So you're no longer a rapper according to the base model, but the fine-tune model says you're a senior technical specialist at
Microsoft based in Brisbane specializing in AI machine learning. So I think that's a bit better. It's a bit better than the base model. Do we have a Lee H?
This was on the meetup attendees list. I
don't have a deep detailed profile for Lee H yet. So why did it do that? Well,
that's because in the training data for people's names who weren't a full name, you couldn't search them. I've just told the model, hey, I don't have a full profile for that person. And so, this is
where I'm getting to the point of these models are infinitely customizable. So,
whatever use case we'd want, Lee H, if I did have information on that person from the attendee list, then I could fine-tune the model to do that. But
because I don't, I said, "Hey, just reply back with I don't have detailed information with that." And then do we have an Adam Laugh here? Again, these
are just randomly sampled. Adam Laugh
might not be here, but maybe they're on this Oh, Meetup's website for some reason. You have to So, there we go. Apparently, they were they were
we go. Apparently, they were they were originally on there. Okay, so now we evaluate the model. This is going to be the rogue L score. So, the rogue L score
basically just me measures similarity.
Is this the same technique um that say the Chinese government might use to stop people asking about square? Do they do something more
about square? Do they do something more sophisticated than just the one?
>> Probably exact within the realm like it's it's uh that's what I mean by these techniques is it's like it kind of it gets painted as this like really sophisticated thing but it it at the end
of the day there's there's multiple stages to it but supervised fine-tuning is one of the main steps that basically happens after pre-training. And then
there's another step that we could take even further which is reinforcement learning. Um and so basically that's a
learning. Um and so basically that's a reward training system where it gets a zero or one for producing the output that you want it to do. So in the Chinese case um you might have a lot of samples that go if hey if someone puts
in tanaman square in the prompt and it responds with um that event doesn't exist or I don't know how it's supposed to respond. Um it'll give it a one for
to respond. Um it'll give it a one for the reward. Um and so that's but it
the reward. Um and so that's but it could also just do you could pass in 10,000 samples of hey asking about this and then output like this. But that's
how a lot of these models are trained.
>> I also understand that people have been able to reverse engineer it to remove it. So is it possible to remove the
it. So is it possible to remove the learning that occurs after a model's been created?
>> Yes. So um basically that's kind of what we're doing here with Gemma 3 as well.
So it's learned a way to respond to Daniel Burke, but I've just tweaked that and then we could even with the next training run reverse that probably not exactly to the base model, but to a
different use case. So if I wanted it just every time it replies Daniel Burke, it goes quack like a duck. We could we could get it to do that. Um, so that's kind of how flexible these models are.
If you have a token space that you want an input, you can create basically whatever tokens that you want to output.
I could have got this model to output almost anything based on whatever names we put in. I just went kind of safe mode and just spit out LinkedIn chat. Um, so
when we evaluate it, how do we go? So
again, this is this is really small model. This is 270 million parameters um
model. This is 270 million parameters um in today's world. So I think maybe the flagship GPT would be close to a trillion. I don't know. There's no
trillion. I don't know. There's no
official parameter counts on this. But
if you just go to the flagship open source models. So basically one would be
source models. So basically one would be perfect. We have 85 samples which are an
perfect. We have 85 samples which are an exact match on the test set out of 150.
And then partial we have 51. So we have these are the use cases that we would look into and go hey why aren't these matching up? And then uh because text is
matching up? And then uh because text is infinite, we could just easily add in more samples to improve that. And then
if it still wasn't working, then we would look into different kinds of training techniques. Um but for now,
training techniques. Um but for now, this is a very crude experiment of just hey, how do we fine-tune a model? This
was I did this workflow in a couple of hours. Um all the samples are synthetic.
hours. Um all the samples are synthetic.
They're generated by GPTOSS. So GPTOSS
12B hugging face based on LinkedIn data.
So if I go who is Daniel Burke and then I just copy this. So quick
Google search and then I say um to GPT OSS which is a powerful so powerful open source model sorry generate me five QA
pairs for Daniel Burke based on this paragraph. So really as simple as this.
paragraph. So really as simple as this.
Um, and then it's going to give me back some synthetic data. There we go. And
then I say, well, I can turn it into JSON myself, but I turn that into JSON.
And then that looks exactly like the data that we trained our model on. And
so that's what we're doing at scale. And
I guess this is the same process for whatever your specific use case is here.
Um, you can do ex very similar to this.
Of course, it's just going to change based on what data you want in. So,
let's go back and then we'll finish it up. We're just going to make sure we'll
up. We're just going to make sure we'll upload it to Hugging Face. And I'm going to put the suffix live so we all know that it's it's not a a pre-baked model, but this is going to upload to Hugging
Face. We could make this private if we
Face. We could make this private if we wanted to um if we had an organization just like GitHub. Then we're going to create a demo and we're going to upload this demo to HuggingFace spaces. Again,
as we see, we've got the live tag there.
But like all good cooking shows, I prepared something earlier. So if we go to my profile, we should see the live.
Yeah, there's the live version. So that
is building. We're not going to wait for that to build. That probably take about 10 minutes or so. But then if we have our Where's the compare one? Based verse
fine tune. This is one I prepared earlier. So if we type in Daniel Burke,
earlier. So if we type in Daniel Burke, what do we get back?
So we've seen this we've seen this kind of before but this is just a demo now.
So in the past few clients that I've worked with I've basically done this exact workflow. Create a data set
exact workflow. Create a data set fine-tune a model upload a demo like this to hugging face that we can uh explore easily just basic inputs and
outputs and then we basically test all of these things uh or sorry go through a bunch of use cases manually uploading data seeing what happens and then we flag those. I improve the data set. I
flag those. I improve the data set. I
improve the model. We create version two and then we get the model ready for uh production. So repeat this ad nauseium.
production. So repeat this ad nauseium.
And so we might we have a name that we wanted to uh try before.
Okay, let's see how the fine tune model goes.
If it's not in the training data set, it'll still be shocking. But hey, that's the reality of ML projects is that uh there's a lot of edge cases and especially with um LLMs because the
input space and output space is infinite.
Brisbane based professional student currently associated with Saravia and Griffith University. No,
Griffith University. No, >> all BS. Okay. Well, we've created a a hallucinating model, but at least it um
>> where's that? Oh, Gary Park.
Yeah, the baseball is definitely better.
>> So, so remember this is this is a tiny model, but this is a perfect example of where a hallucination has occurred is
because the fine-tune model has actually learned the structure of what we wanted it to output, but all the facts are wrong. Now, again, this is probably not
wrong. Now, again, this is probably not a workflow that we would actually want to do. Trying to get a small model to
to do. Trying to get a small model to remember facts is probably not a good use case of fine-tuning. It was just one of the simplest things that I could demo in in 2 minutes. But the structure here
is exactly how we want it. So it looks real. Now, how would we get this to be
real. Now, how would we get this to be factual? Well, this is probably where
factual? Well, this is probably where we'd want to build like a rag system of use everyone's names as a database, but the Gemma 3 fine-tune model just takes those inputs and formats them how we
like. So that would be the next stage of
like. So that would be the next stage of this workflow. But for a V1, it's pretty
this workflow. But for a V1, it's pretty good. Now, um, let's go back to the
good. Now, um, let's go back to the keynote and see. Oh, we need to finish with a Haiku. So, oh, I said this we' get this model running on a phone. Now,
this is running on uh an A10G.
So, um I did actually get this running on a phone, but which is a sophisticated Nvidia GPU.
We had to I can't do this one live because we have to convert it to MLX because MLX is like Apple's version of PyTorch. So we have the 8 bit MLX
PyTorch. So we have the 8 bit MLX version here. Um, and that's 285 MGB. So
version here. Um, and that's 285 MGB. So
this one is going to run on a phone. But
then if we have the one that we just trained is 535 36 megabytes. So if we deploy it
to our phone, I think this is going to play. There we
go. So this is screen recording. This
app's on my phone. I can show it to you.
Um, but my phone's recording, of course.
So because it's running locally, we get really fast tokens per second. So this
is really cool to see.
So if we try another name.
So that's the beauty of local models. It
can run. This is I should have put it on airplane mode so you know that it's offline.
There we go. Got that one correct.
And now we're going to finish with a haik coup.
Small model, big punch. Train your own or hire me and I'll do it for you.
>> I think it's all right. Thank you, Dan.
all right. Thank you, Dan.
It would have been nice if something broke there and we could judge you for it. But
it. But we'll do I'm conscious of time so we might do sort of five minutes of questions and then we'll let everyone head off and then if I think we can have a little bit more time got further
questions. Okay.
questions. Okay.
>> Are these small models good for text to text as well or they can mult >> so just take so can you say that again?
The small one is are they just good for text to text or they can do multiodal or like speech to speech, speech to text?
>> So, so I would say the text modality is definitely the best by far. The vision
modality is basically nearly on par. Um,
with Sunny, we had almost no problems with the vision modality. Um, but the speech to speech is probably the one I am least experienced with. However, that
I think is seeing the most growth in terms of uh models that have been recently published. Um, so would you
recently published. Um, so would you what would be your use case? You want to generate synthetic speech? Is that what you mean? Or transcribe from speech to
you mean? Or transcribe from speech to text?
>> Yeah.
>> Uh, I see. But um, okay. So, the speech to text you can there's several versions of these on Hugging Face right now that you can run state-of-the-art on your device live. the text to speech. Uh I
device live. the text to speech. Uh I
don't have enough experience there with the the yeah the inverse of that.
>> Thanks. That was a great talk. The I
missed something on the fin,000.
So I artificially increased the so the original thousand was uh so about 120 guests from the meetup page and the original thousand was about five
questions per person plus um like random names like Elon Musk or Steve Irwin or something. If you type in Steve Irwin
something. If you type in Steve Irwin it'll say they're not here. But then I found that wasn't enough data to sort of saturate the model. So then I just synthetically uh I guess data augmented
the existing samples and gave them a whole bunch of different variations of the same sort of question set. So so
just just upscaled the the original base data set of basic QA pairs and then also I found that lowerase would really trip the model up. So I just overindexed on
lowercase examples of if you typed in someone's name lowerase um or typos as well artificial typos I put that in the data set. So that was V2 just basically
data set. So that was V2 just basically upscaling all the original samples.
>> Hi there. Um I got a question about what data we use to train the model. Um let's
say for example you know of data that's on the internet really similar to your um your image type data. If you know the model being trained on data already but
use that as fine tuning the model with that additional data set does that is the is that worthwhile the purpose I guess is to try and highlight that data
as being good.
>> Yeah approach.
>> Yeah. Exactly. So even if a model has been trained on yeah the entire internet if you have a specific use case it's generally uh worthwhile to fine-tune on
your specific use case cuz you will be quite shocked at how quickly your base model can get really good at your specific use case if you have like a
right data set. Um, so for example, like the large models that we're using like with the just the chat APIs, chat interfaces, um, they are incredibly like large models. They've been trained on
large models. They've been trained on the internet. So they're operating in
the internet. So they're operating in basically a zerosot way every time. Um,
so they're kind of using their large parameter space to go, okay, I think this is what he wants. But if your use case is specific, maybe you don't need that large generalist capability. You're
just like, I just need this model to extract this certain data from this certain text the same way every time.
>> You referred to zero models. Have you
have you tried attempts at doing multiple shots and then getting the AI to choose the best of the multiple shots?
So probably the best yeah I guess if I understand it correctly probably the best technique that I found for just if you're just prompting an API is to if you want the right response back is one
it would be examples. So just examples in the prompt of like hey if I give you this input output like this. So in the Neutrify app that we build which just you take a photo of food and it um
analyzes your food. uh in our prompt for Gemini, we just give it examples of like, hey, if you see a dish like this, output it like that. Um, so we're giving it examples in the prompt, which is kind
of like a mini fine-tuning. Um, so
that's how I would sort of explain how that works.
Does that answer your question?
>> Yeah, depends.
>> Yeah, we'll do one more question first.
Very good. And just want to know have you tried to do some evaluation mechanism in your app like um have a
pipeline to when you get more data the model being more time.
>> Yeah. So that's that's a great question.
So where does this loop go to next? So
we've done one like a static fine tuning here. So if we were to deploy the um
here. So if we were to deploy the um this model which is terrible, right?
We'd get a lot of responses from people going, "Hey, this model is terrible."
But the good news is we'd have those inputs and outputs maybe tracked. And so
the V3 version of this model would be we would go, okay, let's track our if I type in Daniel Burke and it hallucinates, let's track why did it do that. And then we'll go in the next
that. And then we'll go in the next training run, instead of 8,000 samples, maybe we'll use 25,000 samples or something like that to really overfit on a certain use case. Um, for example, in the Nutri app that we build, you take
photos of food and it just breaks it down. Um, we have this exact pipeline in
down. Um, we have this exact pipeline in place. So any photo that gets uploaded
place. So any photo that gets uploaded or food um and we review that in the back end mostly models review that um and then for the next training run if we if we need to deploy a new model it'll
review the sort of mismatches in the production pipeline and then we'll push out a new version. And so basically our data sets and our models are versioned together so that we can track which
model was trained on which data set and so that when we get production data um say model V3 trained on data set V3.
Does that make sense?
>> Yeah.
>> Awesome. I think we'll call it there.
Dan, that has been an awesome talk and thank you for taking time to prepare that as well. Um, another plug. I don't
think there are enough plugs in there, Dan, to to give Dan a call and get him to do some work. Thank you.
>> Thank you.
Loading video analysis...