Small Language Models (SLMs) Are the Future: Fine-Tuning AI That Runs on Your iPhone

By Daniel Bourke

Summary

## Key takeaways - **SLMs run natively on your iPhone**: A small language model is defined as one that can run natively on your own computer or iPhone, making AI truly personal and portable without relying on cloud APIs. [00:42], [01:02] - **Privacy beats API calls**: Privacy, latency, offline use, and ease of deployment are the primary reasons to run models on-device rather than sending data to third-party APIs—the model trains once and deploys everywhere. [07:36], [07:58] - **Fine-tuning takes minutes not days**: Fine-tuning MedGemma took about 15 minutes of training; the hardest part is constructing a dataset and defining the specific use case, not the actual training. [19:09], [19:17] - **Small models can beat GPT-4**: Qwen 3.5 4B outperforms GPT-4 on nearly all benchmarks, and even at 30% reduced performance due to benchmark overfitting, it still beats GPT-4—demonstrating that specialized small models compete with generalist giants. [14:50], [15:03] - **Synthetic data enables rapid prototyping**: Custom datasets can now be created in hours instead of months by using another LLM to generate synthetic question-and-answer pairs from source material like LinkedIn profiles. [38:04], [38:26] - **Models shrink to 3.5GB via quantization**: A 4B parameter model in float16 is 8GB; quantized to 4-bit, it runs at 3.5GB—comfortably fitting on any modern iPhone with 12GB RAM—while every token shaved off the prompt reduces memory usage. [13:44], [14:02]

Topics Covered

Small LLMs Run Natively on iPhone
On-Device Models Slash API Costs Forever
Quantize to 4-Bits for iPhone Deployment
Fine-Tune Removes Bloat Shortens Prompts
Live Demo: 100s Fine-Tune Beats Base Model

Full Transcript

Hello everyone. Um, one second. Sorry.

I'm just going to record this so that I can put on my YouTube channel for later.

Now, let's get So, we're doing a small LLM's talk. Now,

that kind of sounds a bit weird because I mean small large language models. I

guess I don't actually really have a specific definition for what a small language model is. Um, we can kind of define it tonight as a crowd because this space is I guess as we all know is

rapidly evolving. My current definition

rapidly evolving. My current definition is if you ask me maybe a year ago it would be under a billion parameters. Um,

and who knows what a parameter is if I say that. I just want to sort of gauge.

say that. I just want to sort of gauge.

Okay. All right. Sweet. Don't have to explain that. Now, I would define a

explain that. Now, I would define a small LM these days as a model that can run natively on your own computer or your iPhone. So that's kind of the

your iPhone. So that's kind of the definition we're going to run with for that. And it's my sort of dream at the

that. And it's my sort of dream at the moment is to um shill or expose those models as much as possible cuz I love things that run on your own computer.

Um so tonight we're going to do this talk kind of like a cooking show. Um

it's going to be a little bit here and there because I'll be honest, this is really exploratory at the moment. Um, a

lot of the stuff that I've worked on here has only really been possible in the last couple of months due to framework maturity and model releases and whatnot.

So, on tonight's menu, we are going to um look at some real life examples that I've worked on. We're going to look at various uh case studies. We're going to define what a small language model is, which we've kind of already done. We'll

look at a custom data set and how we make one of those. And then we will do a live fine-tuning session of a small language model. And we'll compare the

language model. And we'll compare the results of a base model to a fine-tuned small model. And this is going to be

small model. And this is going to be like a cooking show of course. And

finally, we'll finish with a haiku.

So my goal for this evening is for all of you to go uh basically ask yourself the question in your business or your life, can we create slash use our own small models, whatever your use case

might be.

Now, a little bit about me. Um, I'm very lucky. I came second in a jiu-jitsu

lucky. I came second in a jiu-jitsu competition on the weekend. So, if

anyone wants to roll sometime, please, um, let me know. I teach machine learning. Just crossed 250,000 students

learning. Just crossed 250,000 students in 195 different countries, which is kind of insane. Um, I have a 4.95 Uber star rating. Um, my PyTorch video on

star rating. Um, my PyTorch video on YouTube is the most watched on YouTube, which I mean, it's a 25hour tutorial. I

never thought that that would amass 5 million views somehow. My brother and I, Josh, he's sitting over there, we build an app called Neutrify, that's my formal degree, is food science and nutrition.

Um, so marrying up machine learning with that. Um, I partner with businesses and

that. Um, I partner with businesses and I build custom data pipelines, models, you name it. My most recent one was uh with a conservation company in America

to detect uh 900 different types of bird species um with small computer vision models. also models that had to run on

models. also models that had to run on device in offline environments. And I

think the song, the Kendrick song that describes me the most is humble.

Um, so I would describe myself as 80% of Brian Johnson. Um, I do love health and

Brian Johnson. Um, I do love health and I'm kind of like Steve Irwin except Crocs. I mean AI and ML are like my

Crocs. I mean AI and ML are like my Crocs. Um, and so that's me. If you want

Crocs. Um, and so that's me. If you want to learn how to optimize your PyTorch models, I have a great video that um, turns it into a song. I won't sing it for you tonight.

Okay, so let's do case study number one.

This is something that uh my brother and I have cooked up earlier. Um like all good cooking shows, this is for a Kaggle competition that um we entered last month and it was the Med Gemma Impact

Challenge. And so Medgema, has anyone

Challenge. And so Medgema, has anyone heard of Gemini?

So Gemma is like Google's open- source version of Gemini. It's like smaller versions of Gemini. Um Medgema is medical domain fine-tuned Gemma. So,

I'll just play the video and then we can discuss and it really ties into small language models.

Good day. Welcome to Australia, more specifically the beautiful Burley Beach on Australia's Gold Coast. Now,

Australia is known for its beaches, known for its sun. However, that sun comes at a cost. In 2023 to 2024, Australia spent $2.5 billion treating

skin cancer and other skin cancer related diseases. And tragically, 2,000

related diseases. And tragically, 2,000 people or so this year will die from melanoma and other skin cancers. Now,

the good news is is that if it's caught early, it can be treated early. And

treating skin cancer early on is not only more cost effective, it results in better survival rates. And that's where Sunny comes in. Sunny is an iOS application powered by a fine-tuned

version of Google's Medgema we call Sunny Medgeema that encourages people to do structured self-skin examinations.

Research shows that 44% of melanomas are discovered by the patients themselves or their partner and Sunny is designed to build on this momentum. It fills the gap between the excellent sun damage

prevention marketing Australia does and a lack of a national screening program.

Sunny's goal is to double the reported number of Australians who do a yearly self-skin examination from 26% to over 50%. And to increase the number of skin

50%. And to increase the number of skin cancer treatments performed at earlier stages 0 and one compared to stages three and four by 20%. I went for a skin

check the other day and the doctor found a lesion on my toe I didn't know about.

Luckily, they said it's nothing to worry about. They even mentioned I could take

about. They even mentioned I could take a photo on it with my phone and revisit it in a year or so. That's exactly what Sunny is for, the take a photo of it and revisit it in a year workflow. Of

course, when dealing with skin photos, privacy is paramount. Sunny runs

entirely on device and is locked behind biometric or passcode authentication, similar to how your banking app would function.

When you take a photo with Sunny, it uses a Sunny Med Gemma model to generate a structured response similar to what a dermatologist would note down.

Your photos never leave your device for inference or for storage. And when it comes time, Sunny makes it easy to export these scans into a report that can be shared with your doctor. It

creates a repeatable habit for self-skin examinations.

Importantly, Sunny is not a diagnostic tool, but a tracking tool.

getting a bit of insight into my YouTube recommendations. Okay, so

recommendations. Okay, so Sunny Q&A. Now, why that was So, we just

Sunny Q&A. Now, why that was So, we just submitted that about 2 weeks ago. So,

that's my brother Josh. He's an iOS engineer. That model runs directly on

engineer. That model runs directly on the iPhone. Now, why would it be

the iPhone. Now, why would it be important in a use case like that for the model to perhaps run on the device rather than sending any data to an API?

>> Privacy.

>> Privacy. Yeah. Anything else? That's

that's probably the biggest one >> without an internet connection. So,

remote third party countries and so on.

>> Exactly. Yeah. And latency as well.

Yeah, that's a good one.

>> Sorry.

>> Exactly. Yeah. So, they're probably privacy is probably the number one. And

then latency and offline use and then ease of use, right? So, you train the model once, you deploy it to the iPhone, it goes out as it could be, if we were to take this further as a sunny pitch to the government, we'd be like, hey, this

could be our national screening program.

Of course, it's not production ready. it

was a entry to a Kaggle competition but we can kind of see where that would go and the importance of privacy in that case if we go to the next one like that is the big important in the health data

space so that's the idea of medge gemma and that's a sort of the I guess that was the competition's goal is to show use cases of where the medge gemma model

could be used um and so if we go um to cost co oh sorry cost is the next one so these models models take quite a big upfront investment to create but then

the finetuning as we'll see later on in the live demo can be h can happen in minutes. So if we look at the costs now

minutes. So if we look at the costs now that's a really niche use case and taking skin photos. If we look at Gemini 3 flashes pricing as of today if we were

to deploy Sunny as a national roll out and say within a few weeks had 10 million photos. um with the current

million photos. um with the current Gemini pricing it would be about $55,000 which is not that much like in the grand scheme of things in terms of how much Australia invests into healthcare but

this is still this is just one application uh across many potential different use cases and so when you train your own model u yes you do have

that upfront investment um however you can now run inference for free because you deploy that to the iPhone and all the compute just happens locally on the device so there is no charge for inut

tokens and there's no charge for output tokens. You could run that as many times

tokens. You could run that as many times as you want. And so if that was 10 million images, what if that happened every year for the next 20 years? That

model could continually run over and over and over again. Um so we're going to discuss uh small language models or language models in general, custom models in general from a hardware

perspective. So how did Sunny work? So

perspective. So how did Sunny work? So

Sunny is a vision language model. Uh so

it has a vision component and a language model component. Now the current

model component. Now the current state-of-the-art on the iPhone hardware in particular which is where our sort of expertise is and I guess this is it's current sort of hardware chips on

consumer devices aren't yet completely optimized to run uh LLMs as fast as possible. They are getting there. Um but

possible. They are getting there. Um but

the current best practice is you run the vision component on the uh NPU on an iPhone. So that's a neural processing

iPhone. So that's a neural processing unit. So whenever Apple shows this

unit. So whenever Apple shows this graphic when they release um a new iPhone every year and they say the neural engine so the neural engine um is

specifically uh built for tensor operations um but at the moment that is really good for vision based models. So

in the medge jammer model that can run entirely on the neural engine. Um so if you see that there's an Xcode analysis there. Um that's why Sunny it does take

there. Um that's why Sunny it does take a little bit to warm up because we haven't I guess optimized the load in yet on the app. But the inference once it's warmed up is basically instantaneously um because the image

gets processed in milliseconds and then the LLM happens because it's auto reggressive token by token that happens on the GPU. Um and one big constraint of Oh, you go

>> just a question on that is is that like what Apple recommend or is that just like it just sort of sucks everyone does it this way.

>> So that is what we've deciphered from the research. So Apple published a

the research. So Apple published a paper, I haven't got it here, but at fast VLM and that's how they did it. And

just in practice or there's nowhere this is like explicitly stated, but just in practice, this has been found to work.

And I don't think it's a sort of a I think it's more of a oh, it's funny that that works like that rather than we've designed it to be this way. Um, I think because vision models are a lot more

sort of I guess the hardware is built out for them more than LLMs. um the the NPU is really good at doing a big batch

process over the image whereas the uh the GPU is better at doing this token by token calculation for now and that's just with our experiment of actually

deploying a a VLM to run on device um and measuring the latency cuz that was one big thing we deployed it at first but then it would take 10 seconds to run and so we had to go into the hardware to

sort of optimize uh what's it going to run on and now this is of Of course, my bias is towards consumer hardware. We've

probably most of us have got an iPhone in here, probably Android as well. It'll

be very similar story on the modern um Pixel chips as well um in the Google Pixel. Not too sure about Samsung, but

Pixel. Not too sure about Samsung, but um that's where my bias is towards cuz not not everyone's carrying around a H100 on them. So, um that's sort of where I'm I'm biased towards is how can

we run these models as fast as possible on the hardware most of us already have.

Um, so memory is a big thing when you're running a model on device. That's a

photo of my left leg, I believe. Um, now

Sunny is not a diagnostic tool. It's

just an analysis tool. So, we've trained it on um open- source images of dermatology. They weren't that great.

dermatology. They weren't that great.

Um, if I'm going to be honest with you from a data quality perspective, but there was they were available commercially. Um, so if we were to

commercially. Um, so if we were to productionize this, we'd number one improve the data quality. Josh and I aren't dermatologists, but we are engineers, so we can get this up and

running. Um, what Sunny's goal is to do

running. Um, what Sunny's goal is to do is just to increase that step of okay, maybe 20% more people are doing self checks once a year and then that leads

downstream to uh more skin cancers getting found earlier on rather than later stage. Um, so if we go back to

later stage. Um, so if we go back to hardware, when you have a footprint, uh, the modern iPhone 17, iPhone 17 Pro has 12

GB of RAM. So you got a fair bit of space there. And so um, when you look at

space there. And so um, when you look at a model that's 4B parameters in 4 billion parameters, that is in float 16 that's about 8 GB. And so we've quantized this down to, we'll get to

that in a second, in software, to 4 bits. So we can get it to run with 3.5

bits. So we can get it to run with 3.5 GB of memory which runs comfortably on basically any modern iPhone.

Um speaking of software, so this is Sunny's pipeline. We have an input data

Sunny's pipeline. We have an input data set, a photo of skin um synthetically labeled um with Gemini and then we upload that to hugging face data sets and then we fine-tune the sunny medge

gemma model in a supervised fine-tuning fashion with uh hugging face TRL which is transformers reinforcement learning and then we deploy that to the phone

using um hugging face swift transformers as well as Apple's MLX which is basically like pietorch but for Apple silicon. So the MLX e ecosystem has

silicon. So the MLX e ecosystem has exploded over the past year or so. Um so

now basically every time a new open source model gets uploaded to hugging face you can run that locally within a day or two if your Mac has um enough RAM basically. So that's the only limiting

basically. So that's the only limiting factor at the moment is how much RAM you have to how big of a model you can run.

Um the recent Quen 3.5 models. I think

the 4B model outperforms GPT 40 on basically all the benchmarks. And of

course, it's probably been benchmarked maxed. So even if you took 20 to 30% off

maxed. So even if you took 20 to 30% off that, that's pretty cool that you can have a GPT40 running locally on your MacBook now. And so my sort of I don't

MacBook now. And so my sort of I don't know guess or like hope is that like by the end of this year we're sort of trending towards GPT5 running live on a MacBook. not GBT5, the headliner model

MacBook. not GBT5, the headliner model from OpenAI, an open source variant that can run on your local computer.

Um, so this is a podcast with Jeff Dean.

Does anyone know who Jeff Dean is?

Probably the real nerds. Um,

but he's one of the leading scientists at Google. And so this is um his talk on

at Google. And so this is um his talk on uh latent space recently and he had a lot kind of a throwaway line but I really liked because I've seen this myself just in practice in building the

sunny model um as well as other different models is to go very low on precision. So has everyone heard of like

precision. So has everyone heard of like precision in computing if I say like float 16 int 4 int 8 all that sort of jazz. So where the research is kind of

jazz. So where the research is kind of going at the moment is you go large on parameters but then you go hard on uh precision and intu I don't have a good reason of why that works but intuitively

um to me it kind of makes sense more parameters give your model more capacity uh because you have that larger capacity even if you do lose some precision you still get that performance on the outer end because they just have such a sheer

number of parameters um and so that's we're kind of talk getting in the realm of like all the knobs that we have to turn to get small models to work well on

consumer level hardware. So um yeah, the one thing is the training and then the next thing is okay, how can we optimize the hardware so they're not taking 30 seconds to run at a time and then the next thing is okay, how can we um

optimize the footprint? Precision is one of those ways you can optimize the footprint.

Um that's what we just mentioned. So if

we start in floating point six or 32 which I mean not many people do these days um but that's kind of like the default uh if you create a tensor but that'll be 16 GB for a four par 4

billion parameter model and that's just too large. That was where medge gemma

too large. That was where medge gemma would have started if it was in float 32. Um good news is by default a lot of

32. Um good news is by default a lot of model training is done in float 16. So

it starts off at 8 GB. Um when we quantized it for Sunny we took it down to 4 bits. So um it was about 3 1/2 GB when we deployed it to device. So I

don't know if these are large enough so excuse me but I'll just read these out.

So the original one is on the right that's Google's release um of Medgema 1.5 probably yeah came out just about 2 months ago. Um and so that is about 9 GB

months ago. Um and so that is about 9 GB total. And then for the one we deployed

total. And then for the one we deployed to device was 3.4. And so the next step that we did for deploying it on device was we fine-tuned it. Now the reason why

we fine-tuned it is because um we wanted it to be able to do the same task that we wanted it to do with a shorter prompt. So language models as we all

prompt. So language models as we all know are very diverse things. They can

do almost anything. Um but that kind of is a weakness in a sense. If you want it to do a specific task um you don't need all of the bloat of the larger prompt.

Um because when you're in a small device regime, every token counts because every one of these tokens um 248 of them adds to the memory tally. So before when we

looked in Xcode, we had 3.5 GB of memory usage. If we used a larger prompt, we

usage. If we used a larger prompt, we could almost double that memory usage, which would actually cause it to crash out on iPhones with less than 8 GB of

RAM. And so what we did was just simply

RAM. And so what we did was just simply get it to do the same task that we wanted to do, structured data extraction with a shorter prompt because then the

KVK cache KV cache say that fast 10 times um is a lot smaller because you're just getting it to do uh the task that you want it to do. And so these models

are incredibly easy. Well, I don't want to say easy, but they are um they have such a large capacity. But then when you find because they've trained on basically the entire internet when you

fine-tune them, it's amazing how quickly you can get the results. And so we're going to see that later on. But the

finetuning stage for Medgema was probably about 15 minutes of training.

Like it wasn't that long. The hardest

part these days is constructing a data set and deciding what the specific use case is for your business.

Um these are some examples um that we went through of iterations of testing the model. Um so if we do the base MLX

the model. Um so if we do the base MLX model so that was a skin extract prompt um excessive disclaimer output. So that

was one of a weird artifact from the medgeema model was remember when there was these models were sort of overtuned for safety and they were sort of always put disclaimers on things and they were when they first came out they would just

say hey I'm not a doctor or whatnot. So

we just trained that out of the model.

So Google had released that model with the disclaimer outputting on every response that you would type into it. So

you could just pass it an X-ray and it would always output disclaimer I'm not a medical professional. Um so we don't

medical professional. Um so we don't need that in the small device regime. As

I said every token counts you don't want it uh inputting or outputting excess tokens. Um then we did the long

tokens. Um then we did the long extraction prompt. Hey what if we just

extraction prompt. Hey what if we just give it instructions? How does that work on device? Then that was unstable. When

on device? Then that was unstable. When

you quantize that um when you quantize the default model without uh uh fine-tuning it, it was far less stable.

But then when we move all the weights to do the single task that we want, we got exactly what we want. Actur adheres to structure and shortest output generation time. So kind of uh the main point here

time. So kind of uh the main point here is we went through several different iterations from uh base model to fine-tune model to quantize model to um

model that runs on an app on your phone.

So these are some more case studies.

We're not going to spend as long on here because I don't have as much hands-on experience with them. Glenn 2, if you're doing named entity instruction, is uh a model that can just run on CPU and runs

at scale, outputs structured data. So

you don't have to um output, I guess, send any of your private data to uh an API. QED Nano is a 4 billion parameter

API. QED Nano is a 4 billion parameter model that is on par with Gemini 3 Pro for theory proving. Um I I all the math problems that it actually solved are far

too complicated for me. So, I personally haven't looked at all the things that it can do, but that just goes to show that if you have a small model and a specific use case, you can train it to be

competitive with the state-of-the-art.

Um, SAM 3, um, that's more in the vision space, but now it has a language component. So, you can just type in,

component. So, you can just type in, hey, um, person in a blue top with black shorts. That's me training for the

shorts. That's me training for the jiu-jitsu competition the other day. It

basically perfectly segmented me. So any

vision task these days um I would always just start uh for example the bird annotation one uh start with a large run of automatic labeling and then train

specific models to replicate that from uh a SAM 3 model. So if you're in the vision space um a lot of the sort of the perks that we've got from LLMs are now

crisscrossing back into that vision space. So a lot of document intelligence

space. So a lot of document intelligence is getting far far better because we have far better um automatable data pipelines and then the humans come in and instead of annotating everything bit

by bit, we just correct them over time.

And so that's where a lot of my work as a machine learning engineer goes into.

It's not annotating every single symbol.

It's um running large-scale automatic annotations and then reviewing them and then uploading the pipeline and then optimizing the model for downstream use whether that be deploying to device or optimizing for the hardware that it's

going to run on. Google AI Edge Gallery is example of running uh an Android app with Gemma 3N which is we're going to get to Gemma 3 uh in a second, but Gemma 3N is a model that's been optimized to

run on mobile phones. So, this is just going to run offline. And there's an iOS version now as well. Um, basically like Gemini, but just runs on your phone.

Obviously much smaller scale. Um, and

then yeah, Quinn 3.54B. We just

mentioned this. Um, that's one of the smaller variants of the Quinn 3.5 series, but um, on every basically public benchmark that GPT40 when it came out 2 years ago, um, it excels on that.

And of course, if you took 30% off it because it's been benchmark max. And

what I mean by that, it's just been overtrained on the benchmarks to get good results. Even if you took 30% off,

good results. Even if you took 30% off, it still beats GPT40. And if you've tried it hands-on, like it's a pretty good model. So, I guess that's where

good model. So, I guess that's where we're at. We're like 18 months to two

we're at. We're like 18 months to two years from state-of-the-art to model you can run on your phone. Um, and that's the sort of the environment, which is my

specialty is on device models.

Um, this is another example, VLM edge use case. So, um, uh, I don't know if

use case. So, um, uh, I don't know if anyone's ever gone through this, but it's quite a hard process. My brothers

are over there. our father just moved into an age care facility. But um I would say there's a lot of use cases there where um I guess more intelligent

monitoring systems could be used cuz there's fall detection in the facility.

However, it's just like kind of like a mat that's underneath um the bed. And

now of course there comes into the realm of privacy, but that's where uh smaller models that can run locally. No data

could ever leave the building really.

Um, this is Reachi Mini, by the way.

That's a um, robotic system that you can purchase online. Um,

purchase online. Um, no affiliation. I just think it's cool.

no affiliation. I just think it's cool.

But you could have systems like this running smaller on the edge models similar to the Sunny Med Gemma model that monitor a room. For example, I was there the other day walking past,

someone had fallen over. I'm not a medic. I went to get someone, but I

medic. I went to get someone, but I don't know how long they had been on the ground for. So if there was a vision

ground for. So if there was a vision system in that room, of course you can pair them with traditional detection models, but if there was a vision system in that room, then potentially that fall could have been discovered quicker than

when I just happened to walk past after visiting my dad.

So well, we've kind of already discussed that, I hope. But a small language model is many things, and I guess we've kind of blended this into small custom models. It's custom. It's compute

models. It's custom. It's compute

constrained. It can run offline if you need. It's privacy preserving. If you're

need. It's privacy preserving. If you're

in a business, I partner with many businesses where basically their data either can't leave the building, their data can't leave the country. Um,

smaller custom models solve all those privacy problems. Um, it is an investment cost upfront whether you're investing in hardware or training. Uh,

but then from there on you get basically infinite inference. like the med gemma

infinite inference. like the med gemma model we fine-tune could run 10 million times and we won't see an API bill and then of course I'm really biased towards

this but I like owning your own compute so how to choose a model for your use case if you need privacy likely a custom model if you need ondevice/offline capabilities likely a custom custom

model I want to get started ASAP just go to the API needs most powerful model in existence or you have a really specific big hard use case the biggest model you

can use API model um you want to own your compute stack custom model and if you don't know well you can contact me

like get started with the language model what what recommended hardware that resources like you

>> well so it depends it's kind of a hard question but we're going to use Google Collab in a second fine tuning like

>> so for fine-tuning um you'll see in a second how quickly we can fine-tune a model um using Google Collab which is a free service um you can pay $10 a month which I do and I think it's probably the

most valuable compute you can buy at the moment cuz the GPUs that we're about to see are state-of-the-art GPUs um but also now Macs are getting to the not

quite there but like if you have like a a Mac you could just train it on there.

Um, as for particular GPU, you'd want you want something with like 16 GB of VRAM is probably the minimum for fine-tuning a model unless you're really

patient and then you can kind of just let it run for days on end. So, is that answer like the hardware?

Um, fine-tuning versus rag versus prompting, they're kind of all blends of the same thing. They all really have the same goal. You would start with

same goal. You would start with prompting. um you would use fine-tuning

prompting. um you would use fine-tuning for specific task and you use rag for specific knowledge and then you would just mix and match them all when whenever you need them. So that's kind of my framework of how I view all of

these.

Um so language models are versatile tokens in tokens out. That's probably

the biggest mind shift. If you've ever been in I guess traditional ML or training a uh model or whatnot, there used to be a very limited output. you'd

have a thousand uh labels whereas now the token space is infinite. You can

just but that's the beautiful thing about language models is you can just design your token input and design your token output however you want and generally the models are pretty good at that.

Um so now we're going to move on to an LLM fine-tuning session live. I haven't

really ever done this but we'll see how it goes. And so um here's what we're

it goes. And so um here's what we're going to do. We're going to build who's here. So um we're going to use Gemma 3

here. So um we're going to use Gemma 3 270 mil which is 270 million parameters.

So Sunny Med Gemma was 4 billion parameters. So it's already 16 times

parameters. So it's already 16 times smaller than that model we we deployed to the iPhone. So that's quite a small model. And if we look at Gemma 3

model. And if we look at Gemma 3 variants, they go right up to 27 billion. So this model is 100x smaller

billion. So this model is 100x smaller than the largest GMA 3 model.

Um so let's have a look. Here's what

we're going to do. So data I have pulled everyone's names from um the meetup group. So um this is all public facing

group. So um this is all public facing information by the way. So um this is just go to the meetup event and attend who's here. So if your name was on

who's here. So if your name was on there, you might be in our data set um but just your name, right? And then we have the model which is Gemma 3. And

then we're going to create this demo here which is going to compare the base output of the model that you would just download off the shelf to the one we're about to fine-tune.

So, let's go to Google Collab.

So, we'll reconnect.

And how are we doing for time? This

won't take too long. We going right?

Yeah. Okay. So, I'll just show you. This

is the list. I just went to this list and basically searched whatever name appears here. Just search that. And if

appears here. Just search that. And if

there's well public LinkedIn information that's been into the data set. So our

idea is to enter let me get up the base model. If we enter someone's name it

model. If we enter someone's name it should give back some information about them.

So this is the base Gemma 3. And now this is probably my favorite one of the whole one. We're going to use our host name.

one. We're going to use our host name.

So, Michael, um, I didn't actually tell you about this, but, um, this is kind of funny. If we type in Michael's name,

funny. If we type in Michael's name, this is the generic Gemma 3 response.

Uh, so apparently Michael's a highly acclaimed, influential rapper, songwriter, and producer. He's known for his distinctive style, high lyrical prowess, and his ability to create complex and innovative soundsscapes.

Agree on my LinkedIn.

>> Yeah. Yeah. Yeah. I agree. Okay. Now

let's try me.

So I got a lot of stuff on the internet.

So maybe it's been slurped up into Gemma 3. Who knows?

3. Who knows?

So Daniel Burke is a popular and highly respected figure in the world of cyber security. Okay. I've never done anything

security. Okay. I've never done anything to do with cyber security. Uh he is known for his contributions in the field of cyber security particularly in areas like threat detection and prevention.

Okay. That's not correct for me. Uh,

does anyone want to volunteer as tribute?

>> Yeah. What's your name? How do I Can I find it on here?

>> Can you spell that for me, please?

>> G A U R >> G A U R.

>> See, this is Meetup's horrible design.

Look at this. You have to keep scrolling to I'll type it in here anyway. Okay. G A U R >> A N G A

>> M H E >> Mike Alpha Mike Alpha Golf. Yep.

>> R >> R I P >> L I S. Okay, I'm going to copy that to the clipboard so I don't have to time that again.

Uh Ganga is a common and versatile word in Hindi.

Is that I don't know. Is that

>> okay? Okay. Um, let's let's save that for uh I'm going to put your name right up the top here so that way we can try that one later.

Name.

Okay. So, this is Google Collab. Anyone

used Google Collab before? Okay. So,

it's a it's a free service basically like Google Docs for Jupyter Notebook.

And if you're a data scientist, um you're familiar with Jupyter notebooks, but this is um just a cellbyell ability to run Python. And the beautiful thing about Google Collab is that um Google

offer free GPU backends. Um I'm on the paid service, so I get higher quality GPUs. Um and as I said, I've used this

GPUs. Um and as I said, I've used this for a number of years and I think it's for $10 a month, it is worth it for me.

So here's what we're going to do. We got

a recipe. We need to load a data set.

We're going to inspect our data set.

We're going to fine-tune the model.

We're going to evaluate the model. We're

going to upload the model to HuggingFace Hub for reusability. So, everyone could download this model um if you really wanted to. And then we'll create a demo

wanted to. And then we'll create a demo to compare it to the base model that we just saw. So, we need some dependencies.

just saw. So, we need some dependencies.

And I'll just run through very quickly while all this is running. Transformers

is a library that gives you access to models. Data sets give you access to

models. Data sets give you access to data sets. TRL is transformers

data sets. TRL is transformers reinforcement learning. Accelerate

reinforcement learning. Accelerate speeds up your code. Hugging face hub is like GitHub but for model artifacts.

Gradio is going to be our demo application framework. And then the

application framework. And then the rogue score is an evaluation thing. And

then mapplot lib is going to help us draw some nice loss curves. Okay. So

let's uh see if our GPU is working here.

Now this is a a great GPU. This is a G4.

This is a Nvidia RTX Pro 6000. We have

close to 100 GB of VRAM which is like crazy. Um, and it's also Blackwell

crazy. Um, and it's also Blackwell edition. But if we go RTX Pro 6000 price

edition. But if we go RTX Pro 6000 price AUD, let's see. Yeah, so it's about that's

let's see. Yeah, so it's about that's probably actually cheap. So, um, it's probably closer to $25,000. So, that's

the beauty of sort of Google Collab. You

pay $10 a month and you get access to GPUs like this. Um, and then we go through here, imports. We're getting our libraries that we need set up. um get

some information about the GPU to show you how much VRAM we have accessible.

So, we don't actually need this GPU for what we're going to do. Um this is quite overkill for our task, but if we want to do anything more complex than just training a model to give us back some information from names, then we'd

probably definitely want a higher capacity GPU. Um we could get by on what

capacity GPU. Um we could get by on what we're doing here with the minimal GPU that Google Collab offers. I've selected

this because I want this to be fast.

Um, and now we're going to download the base model. This is we put in an ID key

base model. This is we put in an ID key here from hugging face. And as we see the model is going to start loading.

It's 536 megabytes. So definitely plenty of capacity to run on many of our local devices here. And so if we go here, this

devices here. And so if we go here, this is the base model that Google have released, Gemma 3. I'm not going to read through all this now, but hugging this is basically where I go every morning.

Hugging face to see all of these models which are open source and sorry for the light mode blinding everyone. Is that

incredibly orange? Should I turn off night shift?

>> No, it's just terrible.

>> Oh, >> it's the display. Oh, okay.

>> Okay. But all of these models are open source. So, this is where I'm sort of

source. So, this is where I'm sort of getting to the point. There's a lot of different parts going on here. So, we

have models. We've ticked that off. Now,

we need some data. How do we get data?

So, let's go back to our Google Collab notebook. Um, let's try Oh, this is

notebook. Um, let's try Oh, this is going to try the base model. So, we're

going to see what it outputs. I've got a list of names here. So, I've thrown my brother in here as tribute. Um, there's

Daniel Burke. That's me, William Burke.

That's my brother. Um, what are Daniel Burke's skills? Elon Musk, and then a

Burke's skills? Elon Musk, and then a haiku.

So, this is the base model before fine-tuning. Input output. We've seen a

fine-tuning. Input output. We've seen a couple of examples of that before, but Daniel Burke is a well-known and respected world in the world of computer science and technology. Okay, we've got a different response now. So, that's an interesting thing about LLMs. They're

not always deterministic. Um, William

Burke is a well-known and influential figure in the field of computer science and technology as well. Is that true?

>> No.

>> Oh.

Um, and then what are Denberg's skills?

Highly accomplished figure in the field of artificial intelligence. Wow. Okay.

Thank you. Um, Elon Musk. So, probably

knows about Elon Musk because it's in the training data. And then here's a hi coup. Green leaves softly sway sunlight

coup. Green leaves softly sway sunlight points the forest floor nature's gentle grace. So, it's okay. Not not great. Um,

grace. So, it's okay. Not not great. Um,

now we're going to load a data set. So,

we've got our base model. The next

ingredient is a data set. So, our data set is a way that we guide models. What

do we want our model to do? And as I've said with who's here, which is our demo, we want to input someone's name here, like William Burke, and we want to get back information that

we would find in a quick Google search.

So there's what we just saw before.

Doesn't really work. So how did I make our data set? We have our attendees list. Take these to a CSV. So some

list. Take these to a CSV. So some

people's meet uh meetup username isn't a full name, which is fine. We have ways to handle that in the data. But for

people who do have a full name here, where's my where's the favorite name that we just set out before? Is it here?

Um, what I've done is I've searched for that name and then I've just pulled public LinkedIn information and then we're trying to get the model to regurgitate your LinkedIn profile so that if you were at this meetup and you had this app, you could type in

someone's name and then it would tell you what you should ask them about. Um,

so we've searched LinkedIn for names.

We've created a synthetic data set.

Synthetic just means I've asked another model to create example uh question and answer pairs based on a certain input.

Um that's the beauty of language models these days is that now text and and we're getting close to vision data being infinite. Um so one of the benefits of

infinite. Um so one of the benefits of generative models is if uh or pre-generative models if you didn't have a data set you basically screwed or you just had to go through the um I guess

long iteration phase of creating a custom data set. Now these days we can create custom data sets in hours instead of months. Um so some notes we have

of months. Um so some notes we have plenty of edge cases but we can handle them through experimentation. Some names

don't appear naturally. How do we handle those? These are all questions you'll

those? These are all questions you'll have to ask in your business use case of how do you handle the edge cases? That's

a a neverending question in the world of machine learning. Um so our data set is

machine learning. Um so our data set is hosted publicly on hugging face under Mr. Durk which is my username there.

Queensland AI meetup SFT supervised fine-tune V2. Why V2? Because I started

fine-tune V2. Why V2? Because I started with a V1 and it was a thousand samples, but it was too small. So I iterated and basically 10xed it, took it to 8K and

now it works pretty well. Spoiler. Um,

so we're going to inspect some samples.

This is what LLM data looks like. It's

just tokens in, tokens out. But of

course, we're looking at it in text form. Now, of course, our use case is

form. Now, of course, our use case is very simple. Uh but if you're uh a large

very simple. Uh but if you're uh a large company such as OpenAI, Gemini, Claude, uh Google, sorry. Um your workflow for this is going to be basically the same.

I think OpenAI's last public mention count was they have 633 pabytes of data.

Um so that's just astronomical amount. I

think the common crawl pull was like uh 14 pabytes. So 614 I guess what's that

14 pabytes. So 614 I guess what's that nearly 60x. The common crawl is a copy

nearly 60x. The common crawl is a copy of the internet. So they've got 60x the entire internet now. So GP4 is probably trained on just common crawl, but now we're getting into the realm of synthetic data, right? You need all

these long agent traces. Um, but it's still the same process, still just long series of tokens and the same training methodology. Get your LLM to predict the

methodology. Get your LLM to predict the sequence of tokens. Um, so now we're going to fine-tune with uh SFT trainer.

And while it's training, we might if anyone's got a question they want to ask, but basically for training, we set up some settings here. And then we have a configuration. We use SFT trainer,

a configuration. We use SFT trainer, which stands for supervised fine-tuning.

We pass it our training data set, which is about a thousand s,000 samples, and our test data set, which is about a thousand samples. So training is going

thousand samples. So training is going to run because we have a nice RTX 6000 Blackwell. It's actually going to take

Blackwell. It's actually going to take only two minutes. So maybe one question.

>> Any questions?

>> How do you like what has been properly good enough?

>> Um that is the never- ending question of creating a test setal set. And so that's where you definitely just have to sort of get into whatever your problem domain is. You basically have to spend a lot of

is. You basically have to spend a lot of time looking at your and that's one of the big benefits of making a custom data set is you get time to just look at your data itself like what are your inputs

what are your outputs and it's going to be different for every workflow. So for

example the um for the conservation project that I worked for um there's mill many millions of bird photos available online and so a lot of them are open source public domain so we used

a lot of those for training but for a test set they had a lot of um images that were taken with their specific cameras. So our training set was of

cameras. So our training set was of publicly available bird images. A lot of them are very high quality 8K resolution, but for the actual use case,

it was low resolution um camera in the wild or um smartphone camera and the bird is like in the corner versus like main subject. So, we spent a lot of time

main subject. So, we spent a lot of time just crafting a good test set around the actual bird images people upload to their service. And then that way

their service. And then that way whenever we trained a model, if we knew it performed well on that, then we could safely deploy it.

Does that make sense? Yeah.

>> Uh thank you. I just have a question if you the local one and how did you update it? And just for on the one that if you

it? And just for on the one that if you have the real world day job where the people that you know in the UK you have the

Then you have the angle.

>> Why don't you train on that one instead of you train on the step and then >> one more production to know how the for?

So that's a great question. So um with the bird project, yes, we did have uh so we basically used both. That's the short answer is um uh these days models can

handle almost infinite data. So if you have more, it's generally more to to put it in. Um and then of course you use the

it in. Um and then of course you use the the test set to go, hey, if we add all this extra data, if it hurts it, well then you can always take it out. Um and

then as for how does the model update locally? So for example, in the Sunny

locally? So for example, in the Sunny Med Gemma, um whenever we upload a model to hugging face, it can also do like a git pull. Basically, every time it sees

git pull. Basically, every time it sees a change, it can pull the new one. But

for that specific use case, we aren't necessarily capturing any data from the um from the person's phone. So Sunny can run completely offline. So it can upload uh sorry renew the model, but it doesn't

save any photos to a database or anything cuz of course those are private. So yes, there is a kind of a

private. So yes, there is a kind of a we're not collecting I guess samples to evaluate how it's actually going in the in the field for that particular use case. Um, but that would be, as I said,

case. Um, but that would be, as I said, we'd probably have to do a lot more testing before it be got production ready for that particular use case.

Yeah. Um, so we'll check out the loss curves. As I said, look, the training

curves. As I said, look, the training was about 100 seconds. So, quicker than I could answer two questions. Um, we

fine-tuned a model. Whether it performs well, well, we'll see in a second. So,

that's one of the most beautiful loss curves I've ever seen. Um, but that was only two epochs of training. Um, so that we all saw that happen live. And then if

we save the best model to file and then we're going to reload both our models and then we can compare them side by side. So this is the comparison. One of

side. So this is the comparison. One of

the best things you can do is once you fine-tune a model obviously you want to compare your inputs and outputs. So now

we have the base model. We all saw that before. I typed in my name and it gave

before. I typed in my name and it gave some pretty generic response. Michael is

a famous rapper. I mean I'd probably want to keep that one. So um so now we have names again. So Daniel Burke and One of the weird things with language

models as well is it's kind of obscured behind if you're not seeing the training data, but if you I've fine- tuned this out of the or into the model. If you

just put in whether my name capital letter or lowercase letter, the previous model will give different responses even though we kind of know as people it's like that's the same name. It's just but the model sees capital D and capital B

as different tokens to the lowercase version. So what comes out? So this is

version. So what comes out? So this is before and after fine-tuning.

So Daniel Burke here um what do we get back? So Daniel Burke is a Brisbane

back? So Daniel Burke is a Brisbane based machine learning engineer instructor at zero to mastery. Widely

known for his self-taught part in the AI industry. He previously applied AI to

industry. He previously applied AI to healthcare and insurance challenges during his time at Max Kelson. So it's a little bit that was for a fair few years ago now. But this is what it's scraped

ago now. But this is what it's scraped from LinkedIn. Right today he teaches

from LinkedIn. Right today he teaches thousands of students globally through his courses and popular education platforms like learnpiearch.io.

So that's cool. And then we find that oh, as we see, as I'd said, the capital version of my name outputs this generic paragraph, but then the lowercase

version of my name outputs uh Daniel Burke is a renowned American author known for his complex and often controversial writing style. I actually

don't mind that description. Um, but

then the fine-tune model of course does what we want it to do. It it completes the who's here task. It tells some information about me. Now this is my brother. So William Burke was also

brother. So William Burke was also influential in the field of computer science with the base model but now he is a senior manager in the audit and assurance team at Burl. That's a

hallucination.

That's what happens when you uh so it should be BDO. Um but I guess this is the uh this is the fun part about fine-tuning models. So this is back to

fine-tuning models. So this is back to your question before is like how do you guarantee? So in uh in practice that

guarantee? So in uh in practice that worked that put out the fact that he would put be at BDO but of course in the live demo it's Burl Aerospace. So if

anyone here works at Burl Aerospace that might be in the training set. Um what

should I ask Daniel Burke about? You

should ask Daniel about his experience building a self-taught AI masters degree or how he built the Neutrify app using computer vision. He's also a great

computer vision. He's also a great person to talk about simplifying complex technical topics for a wide audience or his transition from being an Uber driver

to an ML engineer. So well done. What

should I ask William? Oh, it's too hallucinating on your one. So you should ask my brother about JS. Okay, so we kind of get the point, right? Um for

Elon Musk, I wanted it to say, hey, Elon's not here, so I've added training samples to that. Elon Musk is not here tonight. Maybe check out the other

tonight. Maybe check out the other attendees. Um, so we have Shrek. Shrek

attendees. Um, so we have Shrek. Shrek

doesn't appear on the list. And then we have Michael Shir. So this is a little out of date, I think. So you're no longer a rapper according to the base model, but the fine-tune model says you're a senior technical specialist at

Microsoft based in Brisbane specializing in AI machine learning. So I think that's a bit better. It's a bit better than the base model. Do we have a Lee H?

This was on the meetup attendees list. I

don't have a deep detailed profile for Lee H yet. So why did it do that? Well,

that's because in the training data for people's names who weren't a full name, you couldn't search them. I've just told the model, hey, I don't have a full profile for that person. And so, this is

where I'm getting to the point of these models are infinitely customizable. So,

whatever use case we'd want, Lee H, if I did have information on that person from the attendee list, then I could fine-tune the model to do that. But

because I don't, I said, "Hey, just reply back with I don't have detailed information with that." And then do we have an Adam Laugh here? Again, these

are just randomly sampled. Adam Laugh

might not be here, but maybe they're on this Oh, Meetup's website for some reason. You have to So, there we go. Apparently, they were they were

we go. Apparently, they were they were originally on there. Okay, so now we evaluate the model. This is going to be the rogue L score. So, the rogue L score

basically just me measures similarity.

Is this the same technique um that say the Chinese government might use to stop people asking about square? Do they do something more

about square? Do they do something more sophisticated than just the one?

>> Probably exact within the realm like it's it's uh that's what I mean by these techniques is it's like it kind of it gets painted as this like really sophisticated thing but it it at the end

of the day there's there's multiple stages to it but supervised fine-tuning is one of the main steps that basically happens after pre-training. And then

there's another step that we could take even further which is reinforcement learning. Um and so basically that's a

learning. Um and so basically that's a reward training system where it gets a zero or one for producing the output that you want it to do. So in the Chinese case um you might have a lot of samples that go if hey if someone puts

in tanaman square in the prompt and it responds with um that event doesn't exist or I don't know how it's supposed to respond. Um it'll give it a one for

to respond. Um it'll give it a one for the reward. Um and so that's but it

the reward. Um and so that's but it could also just do you could pass in 10,000 samples of hey asking about this and then output like this. But that's

how a lot of these models are trained.

>> I also understand that people have been able to reverse engineer it to remove it. So is it possible to remove the

it. So is it possible to remove the learning that occurs after a model's been created?

>> Yes. So um basically that's kind of what we're doing here with Gemma 3 as well.

So it's learned a way to respond to Daniel Burke, but I've just tweaked that and then we could even with the next training run reverse that probably not exactly to the base model, but to a

different use case. So if I wanted it just every time it replies Daniel Burke, it goes quack like a duck. We could we could get it to do that. Um, so that's kind of how flexible these models are.

If you have a token space that you want an input, you can create basically whatever tokens that you want to output.

I could have got this model to output almost anything based on whatever names we put in. I just went kind of safe mode and just spit out LinkedIn chat. Um, so

when we evaluate it, how do we go? So

again, this is this is really small model. This is 270 million parameters um

model. This is 270 million parameters um in today's world. So I think maybe the flagship GPT would be close to a trillion. I don't know. There's no

trillion. I don't know. There's no

official parameter counts on this. But

if you just go to the flagship open source models. So basically one would be

source models. So basically one would be perfect. We have 85 samples which are an

perfect. We have 85 samples which are an exact match on the test set out of 150.

And then partial we have 51. So we have these are the use cases that we would look into and go hey why aren't these matching up? And then uh because text is

matching up? And then uh because text is infinite, we could just easily add in more samples to improve that. And then

if it still wasn't working, then we would look into different kinds of training techniques. Um but for now,

training techniques. Um but for now, this is a very crude experiment of just hey, how do we fine-tune a model? This

was I did this workflow in a couple of hours. Um all the samples are synthetic.

hours. Um all the samples are synthetic.

They're generated by GPTOSS. So GPTOSS

12B hugging face based on LinkedIn data.

So if I go who is Daniel Burke and then I just copy this. So quick

Google search and then I say um to GPT OSS which is a powerful so powerful open source model sorry generate me five QA

pairs for Daniel Burke based on this paragraph. So really as simple as this.

paragraph. So really as simple as this.

Um, and then it's going to give me back some synthetic data. There we go. And

then I say, well, I can turn it into JSON myself, but I turn that into JSON.

And then that looks exactly like the data that we trained our model on. And

so that's what we're doing at scale. And

I guess this is the same process for whatever your specific use case is here.

Um, you can do ex very similar to this.

Of course, it's just going to change based on what data you want in. So,

let's go back and then we'll finish it up. We're just going to make sure we'll

up. We're just going to make sure we'll upload it to Hugging Face. And I'm going to put the suffix live so we all know that it's it's not a a pre-baked model, but this is going to upload to Hugging

Face. We could make this private if we

Face. We could make this private if we wanted to um if we had an organization just like GitHub. Then we're going to create a demo and we're going to upload this demo to HuggingFace spaces. Again,

as we see, we've got the live tag there.

But like all good cooking shows, I prepared something earlier. So if we go to my profile, we should see the live.

Yeah, there's the live version. So that

is building. We're not going to wait for that to build. That probably take about 10 minutes or so. But then if we have our Where's the compare one? Based verse

fine tune. This is one I prepared earlier. So if we type in Daniel Burke,

earlier. So if we type in Daniel Burke, what do we get back?

So we've seen this we've seen this kind of before but this is just a demo now.

So in the past few clients that I've worked with I've basically done this exact workflow. Create a data set

exact workflow. Create a data set fine-tune a model upload a demo like this to hugging face that we can uh explore easily just basic inputs and

outputs and then we basically test all of these things uh or sorry go through a bunch of use cases manually uploading data seeing what happens and then we flag those. I improve the data set. I

flag those. I improve the data set. I

improve the model. We create version two and then we get the model ready for uh production. So repeat this ad nauseium.

production. So repeat this ad nauseium.

And so we might we have a name that we wanted to uh try before.

Okay, let's see how the fine tune model goes.

If it's not in the training data set, it'll still be shocking. But hey, that's the reality of ML projects is that uh there's a lot of edge cases and especially with um LLMs because the

input space and output space is infinite.

Brisbane based professional student currently associated with Saravia and Griffith University. No,

Griffith University. No, >> all BS. Okay. Well, we've created a a hallucinating model, but at least it um

>> where's that? Oh, Gary Park.

Yeah, the baseball is definitely better.

>> So, so remember this is this is a tiny model, but this is a perfect example of where a hallucination has occurred is

because the fine-tune model has actually learned the structure of what we wanted it to output, but all the facts are wrong. Now, again, this is probably not

wrong. Now, again, this is probably not a workflow that we would actually want to do. Trying to get a small model to

to do. Trying to get a small model to remember facts is probably not a good use case of fine-tuning. It was just one of the simplest things that I could demo in in 2 minutes. But the structure here

is exactly how we want it. So it looks real. Now, how would we get this to be

real. Now, how would we get this to be factual? Well, this is probably where

factual? Well, this is probably where we'd want to build like a rag system of use everyone's names as a database, but the Gemma 3 fine-tune model just takes those inputs and formats them how we

like. So that would be the next stage of

like. So that would be the next stage of this workflow. But for a V1, it's pretty

this workflow. But for a V1, it's pretty good. Now, um, let's go back to the

good. Now, um, let's go back to the keynote and see. Oh, we need to finish with a Haiku. So, oh, I said this we' get this model running on a phone. Now,

this is running on uh an A10G.

So, um I did actually get this running on a phone, but which is a sophisticated Nvidia GPU.

We had to I can't do this one live because we have to convert it to MLX because MLX is like Apple's version of PyTorch. So we have the 8 bit MLX

PyTorch. So we have the 8 bit MLX version here. Um, and that's 285 MGB. So

version here. Um, and that's 285 MGB. So

this one is going to run on a phone. But

then if we have the one that we just trained is 535 36 megabytes. So if we deploy it

to our phone, I think this is going to play. There we

go. So this is screen recording. This

app's on my phone. I can show it to you.

Um, but my phone's recording, of course.

So because it's running locally, we get really fast tokens per second. So this

is really cool to see.

So if we try another name.

So that's the beauty of local models. It

can run. This is I should have put it on airplane mode so you know that it's offline.

There we go. Got that one correct.

And now we're going to finish with a haik coup.

Small model, big punch. Train your own or hire me and I'll do it for you.

>> I think it's all right. Thank you, Dan.

all right. Thank you, Dan.

It would have been nice if something broke there and we could judge you for it. But

it. But we'll do I'm conscious of time so we might do sort of five minutes of questions and then we'll let everyone head off and then if I think we can have a little bit more time got further

questions. Okay.

questions. Okay.

>> Are these small models good for text to text as well or they can mult >> so just take so can you say that again?

The small one is are they just good for text to text or they can do multiodal or like speech to speech, speech to text?

>> So, so I would say the text modality is definitely the best by far. The vision

modality is basically nearly on par. Um,

with Sunny, we had almost no problems with the vision modality. Um, but the speech to speech is probably the one I am least experienced with. However, that

I think is seeing the most growth in terms of uh models that have been recently published. Um, so would you

recently published. Um, so would you what would be your use case? You want to generate synthetic speech? Is that what you mean? Or transcribe from speech to

you mean? Or transcribe from speech to text?

>> Yeah.

>> Uh, I see. But um, okay. So, the speech to text you can there's several versions of these on Hugging Face right now that you can run state-of-the-art on your device live. the text to speech. Uh I

device live. the text to speech. Uh I

don't have enough experience there with the the yeah the inverse of that.

>> Thanks. That was a great talk. The I

missed something on the fin,000.

So I artificially increased the so the original thousand was uh so about 120 guests from the meetup page and the original thousand was about five

questions per person plus um like random names like Elon Musk or Steve Irwin or something. If you type in Steve Irwin

something. If you type in Steve Irwin it'll say they're not here. But then I found that wasn't enough data to sort of saturate the model. So then I just synthetically uh I guess data augmented

the existing samples and gave them a whole bunch of different variations of the same sort of question set. So so

just just upscaled the the original base data set of basic QA pairs and then also I found that lowerase would really trip the model up. So I just overindexed on

lowercase examples of if you typed in someone's name lowerase um or typos as well artificial typos I put that in the data set. So that was V2 just basically

data set. So that was V2 just basically upscaling all the original samples.

>> Hi there. Um I got a question about what data we use to train the model. Um let's

say for example you know of data that's on the internet really similar to your um your image type data. If you know the model being trained on data already but

use that as fine tuning the model with that additional data set does that is the is that worthwhile the purpose I guess is to try and highlight that data

as being good.

>> Yeah approach.

>> Yeah. Exactly. So even if a model has been trained on yeah the entire internet if you have a specific use case it's generally uh worthwhile to fine-tune on

your specific use case cuz you will be quite shocked at how quickly your base model can get really good at your specific use case if you have like a

right data set. Um, so for example, like the large models that we're using like with the just the chat APIs, chat interfaces, um, they are incredibly like large models. They've been trained on

large models. They've been trained on the internet. So they're operating in

the internet. So they're operating in basically a zerosot way every time. Um,

so they're kind of using their large parameter space to go, okay, I think this is what he wants. But if your use case is specific, maybe you don't need that large generalist capability. You're

just like, I just need this model to extract this certain data from this certain text the same way every time.

>> You referred to zero models. Have you

have you tried attempts at doing multiple shots and then getting the AI to choose the best of the multiple shots?

So probably the best yeah I guess if I understand it correctly probably the best technique that I found for just if you're just prompting an API is to if you want the right response back is one

it would be examples. So just examples in the prompt of like hey if I give you this input output like this. So in the Neutrify app that we build which just you take a photo of food and it um

analyzes your food. uh in our prompt for Gemini, we just give it examples of like, hey, if you see a dish like this, output it like that. Um, so we're giving it examples in the prompt, which is kind

of like a mini fine-tuning. Um, so

that's how I would sort of explain how that works.

Does that answer your question?

>> Yeah, depends.

>> Yeah, we'll do one more question first.

Very good. And just want to know have you tried to do some evaluation mechanism in your app like um have a

pipeline to when you get more data the model being more time.

>> Yeah. So that's that's a great question.

So where does this loop go to next? So

we've done one like a static fine tuning here. So if we were to deploy the um

here. So if we were to deploy the um this model which is terrible, right?

We'd get a lot of responses from people going, "Hey, this model is terrible."

But the good news is we'd have those inputs and outputs maybe tracked. And so

the V3 version of this model would be we would go, okay, let's track our if I type in Daniel Burke and it hallucinates, let's track why did it do that. And then we'll go in the next

that. And then we'll go in the next training run, instead of 8,000 samples, maybe we'll use 25,000 samples or something like that to really overfit on a certain use case. Um, for example, in the Nutri app that we build, you take

photos of food and it just breaks it down. Um, we have this exact pipeline in

down. Um, we have this exact pipeline in place. So any photo that gets uploaded

place. So any photo that gets uploaded or food um and we review that in the back end mostly models review that um and then for the next training run if we if we need to deploy a new model it'll

review the sort of mismatches in the production pipeline and then we'll push out a new version. And so basically our data sets and our models are versioned together so that we can track which

model was trained on which data set and so that when we get production data um say model V3 trained on data set V3.

Does that make sense?

>> Yeah.

>> Awesome. I think we'll call it there.

Dan, that has been an awesome talk and thank you for taking time to prepare that as well. Um, another plug. I don't

think there are enough plugs in there, Dan, to to give Dan a call and get him to do some work. Thank you.

>> Thank you.

Loading...

Loading video analysis...