SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)
By Latent Space
Summary
Topics Covered
- Concept Prompts Unlock Zero-Shot Segmentation
- Data Engine Scales 200K Concepts
- Separate Detector-Tracker Resolves Conflicts
- SAM3 Agents Boost MLM Grounding
- Fully Automated Data Engine Next
Full Transcript
Okay, we're here in the remote studio with the grand return of the Rooflow and Leon Space and SAM combo. Uh, welcome to Joseph, my sort of vision co-host, I guess.
>> Thanks. Great to be here.
>> Welcome back. We also have welcome back Nik Ravi who's the lead on Sam to I guess just SAM in general, right? Um,
and we have uh joining us Punctuan who's also a researcher on SAM.
>> Yeah, nice to meet you guys. So,
congrats on Sam 3's launch. I mean, like the the demo each time you step it up like really amazingly. And I think like every time my my general impression or or takeaway when I tell people about Sam
is like just the every time you have a new release like it's like once a year you show up, you drop a banger and then you you like, you know, you just like drop the mic and and and and go go for next year. And you also add a dimension.
next year. And you also add a dimension.
So, I was entirely like weirdly not surprised when Sam 3 had the 3D thing cuz I'm like, "Well, yeah, which which is the next dimension to go. It's like
3D like >> actually maybe just on that I think that's actually a common misconception.
We launched actually three separate models this time. It was SAM 3.
>> Correct.
>> SAM 3D objects and SAM 3D body.
>> Yes, >> those were two completely separate models. And SAM 3 is just the image and
models. And SAM 3 is just the image and video understanding model >> which is on a on a dead backbone and is sped up. Yeah, sorry I didn't I didn't
sped up. Yeah, sorry I didn't I didn't mean to uh sort of pre preface all all this but maybe for just to remind our audience or maybe for for people new to the SAM series of the podcast that we've done so far maybe each of you can sort
of go around and uh intro like your or your sort of entry into computer vision or sort of your relationship with Sam.
Go ahead Nikki.
>> Okay cool. Hi everyone. I'm Nikila. I'm
a researcher at META. I've been at META for 8 and a half years. So really been through evolution of the field um in
that time. I really started working on a
that time. I really started working on a range of different problems in computer vision. Worked briefly on 3D. We've got
vision. Worked briefly on 3D. We've got
this library called PyTorch 3D. But
really started on this segment anything as a project in around sort of late 2021. So it's actually you know been
2021. So it's actually you know been almost 4 years since I've been like working on this segment anything space and you know we started with SAM one in
2023 Sam 2 last year in July 2024 and then now SAM 3. So it's been you know the culmination of a lot of work of a lot of
people over the years. So yeah really really excited to be at this point and you know get to share it with all of you. Um I'll hand it over to Penguin.
you. Um I'll hand it over to Penguin.
Yeah. Hello everyone. So I'm Pongchan.
I'm a researcher at the Sant team. I
have been working in computer vision this field for nearly nine years starting from 2017 kind of I think it's a n time. I have been working in uh MSR
for five years and then kind of moved to meta reality lab to work on uh egocentric foundation models uh on AI glasses for a while and then in 2023
near I moved to Santin and that time is exactly the start time of sense and rene I think that's the lifetime experience I have on the sense team and it's glad
that sens is out and I kind of achieve my original grand goal of computer vision to reach kind of human performance of detection, segmentation, tracking, image and videos.
>> I'm Joseph, co-founder, CEO at Roboflow, where our mission is to make the world programmable. We think software should
programmable. We think software should have the sense of sight and models like SAM and others are critical to unlocking that capability. Now, millions of
that capability. Now, millions of developers, half the Fortune 100 build with Roblow's tools and infrastructure to create and deploy models to production. We've been big believers of
production. We've been big believers of the meta family of open source models all the way back to like mask RCNN and detectron 2 all the way to present of
SAM 1, SAM 2 and SAM 3. The work that the meta team does to advance state-of-the-art and open source computer vision has been bedrock to enabling developers and enterprises
globally to adopt AI. So, we've been big fans of the work and I'm pleased pleased to be joining you today, Swix, to co-host the episode on SAM 3.
>> And you guys shipped your own DTOR model too.
>> Yeah, we've been we've been doing some work to advance machine learning research, too. Like one of the for
research, too. Like one of the for example, DTOR detection transformers, which was born out of Nurup's last year.
I think Swix you actually challenged us.
You were like, "Hey, what are some of the advancements that are happening in computer vision and in visual AI?" And
we had this observation that transformers had surpassed a lot of CNN's in vision tasks but they hadn't been made to run real time as in you
know over 30 frames per second for example on like a small T4 or excuse me small like edge device and hundreds of frames per second on like a T4. We did
some research and published RF data rooflow detection transformer which is you know we kind of joke the greatest of all time model for doing real-time segmentation and obviously detection on
the edge. Now in RF data it's you know
the edge. Now in RF data it's you know you have to have a fixed class list and need to know some of the objects that you want to segment ahead of time. But
for anyone that's running on like constrained compute and on an edge device and wants like an Apache 2 model to do that RF data and its family of models are key to fulfilling that mission and that goal.
>> Yeah. Amazing. Okay. I think we are going to just go into a SAM 3 demo. Um I
think Nikki uh you've prepped some stuff to show us and this is great cuz uh obviously there's nothing better than the creator of the tool showing off the tool.
>> So just to start with like what is SAM 3? So SAM 3 is a model that can detect,
3? So SAM 3 is a model that can detect, segment and track objects in images and videos using what we call concept prompts. So I'm going to start with a
prompts. So I'm going to start with a simple image example and then we'll show you a video example. So a concept can be anything that is a short text phrase. So
here for example we can use something like watering can and you can see the model predicts a mask for the watering can. You can also then um refine the
can. You can also then um refine the prompts using clicks or additional um visual exemplars which I'll show you in
a different image. But essentially idea of a concept prompt opens up the ability to find all instances of an object category without having to manually
click on every single instance as you would have had to do if you were using SAM 2 or SAM one. Now if the model misses any of the um any of the
instances you can add visual exemplars.
So a visual exemplar is also a way to describe a concept to the model. So here
I can add a positive box here and show the model that you know this is also an instance of a flower that we want to detect. So this is just an images but
detect. So this is just an images but what's really cool is you can now also do this in video and so here I'll show you an example. Um maybe this is a
football match you want to track all the players in white for example. So red
jersey or white jersey, you can provide a concept prompt and the model will find the objects in the first frame and then track and detect the new instances that
appear later on in the video. So it's
not just detecting on the first frame, but both tracking those detections and finding new instances that appear throughout the video. And one of the
things we love to do in our demos is also show some real world applications of this. And so one idea here is that
of this. And so one idea here is that you can use this for video editing or adding effects. So here is a really
adding effects. So here is a really simple mask effect. But you can imagine for example you might want to add a trail around the players. You know you
can follow them around. Maybe you want to clone them. So you've got multiple players running around. You can also do background effects, um, for example,
spotlighting players. And so these are
spotlighting players. And so these are just fun things you can do on top of the SAM 3 outputs. And this is just like a way to show people like what you can do.
There's also some templates which basically are prepopulated with a text prompt and an effect. Um, and these are just some fun ways you can use the outputs. But really, you know, the crux
outputs. But really, you know, the crux of it is in this like create from scratch where you can upload any image or video and try SAM 3 on that. And
we'll share the link so you can try it out as well.
>> One of the other demos that I have is like a busy a busy scene for like doing labeling, which we can do later on, but just to give you a preview. It's like if you wanted to find tablecloth and maybe
like back there there's like airplane so I'll do airplane and you kind of get the ability to start to do you find the confidence thresholds >> they do I don't know why tablecloth
wasn't as good I've used that one in the past table maybe yeah cool >> wow look at that >> I think I think the other impressive
thing that you guys emphasized in your launch is also like the latency Um >> I don't know where this this particular inference is running. Uh but it says something like
>> SAM 3 runs in 30 milliseconds on single image if I want 100 detected objects on an H200. Obviously this is an H200 but
an H200. Obviously this is an H200 but >> it's also like just impressively fast and and sometimes basically you can be realtime if you want. Yeah, definitely
on images on images it's really fast and then on video it kind of scales with the number of objects but it's for limited number of objects it's still still
really yeah also add even for video if you if you can't afford kind of GPUs implemented very kind of good kind of paranal inference algorithm so even you
have a lot of object to track you can still get kind realtime uh tracking performance as long as you scale up the GPUs there.
>> So, so I'm reading in the paper it's 10 objects on two00s, 28 on four00s, and 64 on 82 200s or something like that.
>> I don't think there's an architecture there. I don't know if there's this is
there. I don't know if there's this is the parallelism demonstration that we're talking about.
>> Yeah. In fact, when you kind of try the demo the video to the kind of paranal implementation of the kind of video grounding. So, it's already kind of in
grounding. So, it's already kind of in that fast mode. Yeah, it's you try it with a video with like lots of objects and then you can notice that it's
actually not very slow and you get the sense that we are doing the multiGPU inference. Yeah, everyone should try it
inference. Yeah, everyone should try it out and see if it >> uh so okay amazing. So this this thing about uh concept segmentation I feel
like you had a prototypical version of this and and in your paper you really talk about like sort of generalizing it I guess like what was the planning like in SAM 3 like at the start of this? Did
you you know is is is what we have today exactly what you planned for or did you kind of did it emerge as you discover capabilities?
>> Maybe I could quickly talk about yeah in SAM one we did have a proof of concept of text prompting but that was just a very early exploration. and it wasn't
really built out and you know became the most highly requested feature since then. Um and so we you know in SAM 3 we
then. Um and so we you know in SAM 3 we really wanted to do it properly and actually do this in a way that it works in all different um scenarios. And so we had to really think about how to
formulate the problem. So we it could have been that we took open-ended text input and it works for all open-ended text or we could have you more be more focused which is what we chose to do and
really focus on these atomic visual concepts like yellow school bus or purple umbrella and really focus on nailing the problem for these like atomic visual concepts. But Penguin,
maybe you want to talk a little bit about kind of the benchmarks that existed previously and how we had to actually fully redefine the task and the benchmark that we wanted to solve. Yeah.
And maybe just to to add to Pangron's point, like if you look at the size of these benchmarks, the previous benchmark Preon mentioned Elvis that everyone uses, it has about 1.2K
unique concepts. And the benchmark that
unique concepts. And the benchmark that we created which we're calling segment anything with concepts or seiko coco for
short seiko has more than 200,000 unique concepts. If you think about the
unique concepts. If you think about the natural language that people use we don't just use a thousand words. We use
we have a very large vocabulary and we really wanted to build a benchmark that can capture that diversity and size.
>> Yeah it's it's it's really impressive and also like very formulaic I guess or classic that every great model work starts with a lot of data work. I think
basically is you know very scaled up version of the same process for SAM 2.
>> Yeah, in some ways I think the in SAM 3 data engine really was like a very novel and critical component. I think you know to your point competitive advantage in
AI is is not just about um the models but really about the data and maybe even more so is actually the data engine to generate that data and we put a lot of
effort in SAM 3 specifically to try and automate that process a lot. One of the things that we're really impressed by is the diversity and depth as well as
breadth of uses that we see with models like SAM in production. Basically when
you think about computer vision you know folks kind of like always classic they think about like dogs and cats and and simple sorts of things and the reality is like computer vision is where AI kind of meets the real world. So any sort of thing that needs to be seen and
understood you need to have understanding of that thing. So a model like SAM expanding the concepts from like you know a few thousand closed form concepts max in a single model to tens
of thousands of concepts means that you're going to see such a huge acceleration of the number of fields and um applications of the model. So this is
SAM 3, right? So we've already seen and measured some of the impact of the SAM family of models and we pulled some of the updated stats on how impactful SAM is being across the Rooflow community. I
think I think Roboplow might maintain one of if not the largest hosted instances of SAM and we've seen basically 106 million kind of smart polycreated
examples that are that are SAM 1 2 or three powered and we estimate that that's saved humanity collectively like 100 maybe 130 years depending on exactly
how you want to do the calculation of time just curating data and each of those use cases right isn't dogs and cats the internet it's things like I don't know we see medical labs across
the world that are accelerating cancer research by doing things like counting and identifying the automation of neutrfils after a given experiment. Or
we see folks that are using aerial imagery for things like helping a drone navigate through the world or maybe counting and seeing, you know, solar panels from from above or maybe even doing like insurance estimates. We see
folks that are building underwater trash cleaning up robots. So like you can imagine an autonomous underwater bot that's navigating through the Pacific Ocean and identifying and grabbing on and grabbing plastics and and cleaning
up the world's ecosystem. Relatedly,
we've seen some work with aquariums across the US like Embari who are doing work for keeping track of species and identifying the impact of ensuring given
steps that are taken or increasing the populations of given fish with like underwater fish cameras. We see folks in industrial settings like doing work to produce electric vehicles or get products from point A to point B. At the
time of recording this, it's like near Christmas time and it's like high time for holidays for folks that are doing gift giving. And that ends up being
gift giving. And that ends up being really really high time for making sure goods and services show up where they're supposed to be at the at the given point in time. One of the statistics that we
in time. One of the statistics that we track is the frequency with which folks site um works like SAM or or Rooflow or blogs that we publish. And there's now basically like a little over two
research papers published every day citing some of the work across like the the robaf community. And that's folks that are like publishing in in Nature and uh Science Direct and a fairly
prestigious number of journals. And each
of those you got to think about it each one of those publications is someone's like seinal work often 6 12 24 months of effort that's been accelerated from
models like Sam. So it's not an exaggeration to say like models like SAM are speeding up the rate at which we you know solve global hunger or find cures
to cancer or make sure critical medical products make their way to people all across the planet. And at the infrastructure level we're like thrilled and surprised constantly by the breadth and depth of adoption that we see from
the community. I mean, in the first 5
the community. I mean, in the first 5 days of SAM 3, there was like 8 million inferences of folks that were running across all diverse sets of fields and that's actually only increased because we it was released and then there's like Thanksgiving and now it's back and folks
are like hitting it pretty hard. So,
it's been incredibly encouraging to see the both depth of adoption and how much uh the community takes and uses and relies on models like SAM and prod.
>> Yeah. And I think from maybe just to add to that from like meta side like we don't usually get as much visibility into all of these real world use cases.
So you know being able to kind of hear that from Rooflow and having these models available in the platform is like so valuable for us cuz also you know we we get to know how these models actually
work in the real world which is you know ultimately the best eval for a model. So
I think you know it's definitely awesome to hear about all these things that we're empowering.
>> Nikila you had this you had this comment of like the best eval for a model is like it's not necessarily benchmarked.
But what was it? It's like if it works on unreal world things. I think it's a really good sound bite.
>> Probably something like the best eval is if it works in the real world.
>> Yeah, true.
>> And that's like the ultimate goal for all of our models like Sam 1, Sam 2, Sam 3. We want to people to use it out of
3. We want to people to use it out of the box as much as possible. And I think you know with language in SAM 3 specifically there there does need to be
in some cases some domain adaptation.
But we have sort of tried to make that easy. I know Pron you want to talk a
easy. I know Pron you want to talk a little bit about about that like the fine-tuning aspect?
>> I wanted to also endorse like the real world thing. Uh I was just so happily
world thing. Uh I was just so happily surprised when I was visiting the uh CZI imaging institute for in preparation for our pod with uh Mark that they were
using SAM in imaging the human cell and they showed us like how like in reality all these sort of masses are actually like really undifferiated. is really
hard for the human eye to track. This is
actually a simpler one where you can actually there's not this is like pretty clean here. In reality, a lot of it is
clean here. In reality, a lot of it is just like just gray mush and you have to like segment individual lomes out of these. And they they showed us how they
these. And they they showed us how they were using SAM and fine-tuning SAM to do it. Uh yeah, really really really
it. Uh yeah, really really really complicated and also like very meaningful right uh for for for uh basic basic science research. And I also maybe
mention like um this in the paper the the data distribution you can actually see uh what psycho does. So a lot of lot of animals a lot of animals and then very surprisingly few maps. I'm like
maybe there should be more maps. I'll
say hugging space has been doing a lot here uh in in other other companies.
Yeah, this is actually one thing something we get asked a lot is like what's the minimum amount of data I need to fine-tune and you know being able to do that with just sort of 10 data points is hopefully will unlock a lot more than
we can do ourselves.
>> Yeah, I mean the more the marrier obviously this is where ablations are really helpful. You probably didn't have
really helpful. You probably didn't have any fine-tuned ablations in here. I
think this is all data and model training oriented but uh yeah I mean like uh very very very clear and I just have a cheeky curious point. Is there a ne a ratio of what is the ratio of the
negative example to positive example?
Right? So in Nicola's example when you when you were demoing just now, you only selected positive examples. Obviously
there's going to be a lot more negative examples of not class than positive example of class. So there should be some exchange ratio where like negative examples contribute smaller than a
positive example or is that not the case? for positive and negative
case? for positive and negative examples. I don't know that I have seen
examples. I don't know that I have seen like a golden ratio that that works well or not works well, but I can offer anecdotally that a single negative example goes a long way. A common place where fine-tuning is really helpful is
like data that's out of distribution that might might have been impossibly in distribution. Like one of my favorite
distribution. Like one of my favorite fine-tuning examples is like counting Whimos. There's not that much data that
Whimos. There's not that much data that have like Whimos labeled throughout the streets of San Francisco, but Sam does a really good job to identify Whimo as
like a vehicle. If you um prompt with Whimo, it doesn't find anything. You
find vehicle, it find it labels a Whimo as a vehicle, which is valid, but a Whimo is a specific type of vehicle, right? Usually from even just like a
right? Usually from even just like a 10-second video clip, you can actually start to have SAM 3 learn what should have been seen versus as a Whimo versus what should have been seen as as a
vehicle. And even on a single image
vehicle. And even on a single image example, we see that like SAM 3 starts to adapt because it takes the text and image prompt into account when it makes a subsequent inference from like three to five negative examples alongside
positive examples. You start to see the
positive examples. You start to see the model update its priors, if you will, for where it would predict things from what the user provided. All this is written with caveats, right? Cuz like
when you talk about visual world, the negative example and the positive examples could have been a very different perspective or a very different type of object. um like maybe you're like labeling dog breeds and suddenly a new dog breed appears or
maybe you have a perspective where it's overhead and then suddenly you have a sideby-side view. So usually the best
sideby-side view. So usually the best way is to like have these things meet the real world data and try but I'll offer maybe the note that a small number of negative examples goes a really long
way like small like three to five not like hundreds.
>> Yeah. The other place where negatives play a big role is just is it in the image or not? And that was one of the things that we did was really separate
the problem into a recognition problem and a localization problem. So first can you answer the question is this object or is this concept in the image and then
if it's in the image where is it in the image? And so to really uh to to really
image? And so to really uh to to really build in that capability, we had to annotate a lot of negative um phrases in images. So basically a lot of phrases
images. So basically a lot of phrases that don't exist in the image in addition to the concepts that exist in the image with the corresponding mask
pair. So we have you know if you look at
pair. So we have you know if you look at our uh t one of the tables in the paper which shows the training data set distribution I think it's table 24 we
have about 70 more than 70% of the annotations are these like negative phrases that are not present in the image. So we have to really train the
image. So we have to really train the model to not detect stuff that is not in the image. Yeah, I think that the
the image. Yeah, I think that the separation of localization and it's like it's basically precision recall, right? But
in the vision domain, >> we basically add this presence token to the model which explicitly separates the task of recognition and and
localization.
So >> yeah, we got >> basically it it simplifies the task and so the model doesn't have to try to do
everything with just the um with just the proposals in the in the detector be able to have this global like sort of
learned token just for the recognition part.
>> Yeah. Uh, in general, I find that you you guys did a lot of extra net new work. You had a really nice chart in
work. You had a really nice chart in here about like the yellow boxes being like the the new stuff. I forget where.
>> Yeah, the architecture diagram.
>> Yeah. I'm like, holy crap.
Last time it was it was like, you know, there's like the memory stuff. Uh this
is SAB 2 and here there's all this um obviously we you know it's hard to cover it all but you know I I wonder if there's any other interesting stories or
or tricks like the presence token that you you might want to focus on.
>> Yeah, I mean this this is nice this diagram. I'm glad you brought it up
diagram. I'm glad you brought it up because SAM 3 isn't just a version bump.
It's you know an entirely new approach to do segmentation. And it's like this, you know, a new interface for segmentation and it combines so many different tasks where previously you
would have needed a task specific model for each of these tasks. You know,
interactive segmentation, text prompting, um, open vocabulary detection, tracking, like all of these tasks you would have needed a separate model. And so you really had to do a lot
model. And so you really had to do a lot of work to bring it together. I think
one of the things we did was really decoupble the detection component and the tracking component. So you can see, you know, we still preserve the tracking
components from SAM 2, but the the detector is is separate. And the reason we do this is if you think about what a detector has to do and what the tracker
has to do, the detector needs to be identity agnostic. So if you have a
identity agnostic. So if you have a concept dog, it needs to be able to find all instances of that dog and it needs to sort of have this representation of
dog that is the same for all dogs. But
when you're tracking those dogs through the video, each dog needs to have a separate representation such that we're able to preserve the identities. And so
there is this kind of task conflict that emerges between the detector and the tracker. And so we really had to you we
tracker. And so we really had to you we experimented a lot. We really tried to build kind of a unified approach to do things. But then what we found was
things. But then what we found was having um the separate detector and tracker really worked. But we share we use a perception encoder as a shared
visual backbone. Um and there's sort of
visual backbone. Um and there's sort of a text and image aligned encoder. You
can see the green boxes there. They're
from it says from PE that's perception encoder. I was also from our group in
encoder. I was also from our group in fair at the time. This was released earlier this year in April. And so this really is bringing together components
from like the entire fair and meta ecosystem. We have perception encoder.
ecosystem. We have perception encoder.
We have a deep detector. We use SAM 2.
Um we also use llama and our data engine. And so you're really like using
engine. And so you're really like using all the components from um >> yeah it's like any any third film in a trilogy like you always see like the the previous recurring characters come back.
>> Yeah. Well if it work you got to continue using it.
>> And to connect to something we discussed earlier you mentioned that at video component each object needs to be tracked independently. That's why the
tracked independently. That's why the compute scales linearly with the number of classes right because each of those instance types needs to be maintained >> each of the it scales with the number of detected object.
>> Yeah. So each so for example like each dog that appears in the video each one of those needs to be tracked independently. There was something else
independently. There was something else that you started to allude to in the paper that I was hoping we would spend some time discussing and it's the interaction of SAM 3 and LMS llama and
others. So using SAM 3 to almost be like
others. So using SAM 3 to almost be like a tool call for LLMs to give them better grounding and give them better visual understanding. And there's a a paper in
understanding. And there's a a paper in the table where you describe the increase in in performance. It's kind of alluding I think to maybe where things are going for using SAM 3 as a component
part of multimodal architectures. Do you
want to describe a bit about what the introduction of that work was meaning to showcase and how the interaction of SAM 3 and LMS is envisioned to be important?
>> Yeah, maybe I can just do a quick intro and and I'll hand it over to Penron to to do the deep dive. But essentially as I mentioned SAM 3 we constrain the text
input to these atomic visual concepts like you know yellow school bus or yellow watering can but obviously people want to interact with the model with
natural language and we want to enable that as well and so that really uh become segways into being able to use SAM 3 as this like visual agent for an
MLM and so I'll hand over to Pengron maybe you explain about the SAM 3 agent setup and then talk through some of the results that we got there.
>> Yeah. Yeah. So, as Nika mentioned, the big picture is that Sans 3 is focused on this kind of atomic kind of concept, but people definitely want to try kind of
much more complex faces like okay, could you locate the bigger kind of character for me? What kind of for example this kind example? What is the
kind of the the feature that distinguish male and female in this picture? Then
this are more kind of complex language.
This is exactly kind of sensory cannot do but sensory agents target to kind of solve. In this case you can see that it
solve. In this case you can see that it need much more advanced language understanding and reasoning. The sensory
currently do not have this kind of uh capability because it's small language encoder. uh but we know that large
encoder. uh but we know that large language models definitely going to kind of was trained on a lot of this data and has this kind of word knowledge and the
reasoning capability kind of sensory sensory agent is exactly using sensory as I for the large language models to
solve this kind of complex uh visual grounding tasks. Is there any sort of um
grounding tasks. Is there any sort of um insights that you or surprises that you have other than I guess like Sam is sensory is a very good tool. Is that the
main conclusion?
>> If you go to go to table eight in the paper as you as you describe this if you don't mind >> table eight. Okay.
>> Yeah.
>> Yeah.
>> Here we go.
>> Yeah. Please
kind of quickly reply to kind of switch uh kind of question. I would say that first beside that sense 3 is really a good tool kind of provides the eye for n language model the other thing we
definitely found is that sense three is not perfect it's not like kind of as robust as kind of human eye then n language model also kind of helps to
correct the kind of sound error kind of they have a synergy between each other instead of just okay n language model provides the brain and sensory provides the eye >> interestingly you use lumber four I I
saw There's a mix of number three and number four here, but it it looks like it does best with Gemini 2.5, which makes sense given this comparable set of MLMs. I'm just like I I think like the
baseline also is just that like well what extra addition does this add on top of just the MLM? Like I I would maybe like want to do that ablation. Maybe
maybe you already you've already done it somewhere.
>> What do you mean by additional sync?
>> So so basically like without the tool call there's some native capability inside the MLM itself.
Wow. In fact, that's Yeah. Yeah. That's
a really kind of good question. In fact,
our kind of reviewer even asked that question. So, without kind of you can
question. So, without kind of you can imagine that without kind of N language models kind of without VM kind of swaying kind of only
for kind of reason sake it only achieves about kind of on the validation set if I remember it correctly it only achieve kind of 30 kind of numbers there. And
also kind it's very intuitive.
You can see that for rhythm sec it has this kind of uh kind of short noun and test set it has kind of different subset short noun short then it's very close to
sensor training data like it's kind of atomic phrases short phrases none is this kind of very kind of complex reasoning you will see that kind of for short sensory only is very close to kind
of the kind of sensory agent but for none the gap is so large which indicates that okay that is exactly kind the capabilities that's natur language model bring in.
>> Got it.
>> I can show an example here that might be insightful too.
>> Go for it.
>> So even comparing even comparing like SAM 3 and Gemini. Let's say that we just want to have them do like an object detection task here of finding here.
We're going to prompt with a speedometer and RPMs and we're going to ask for things like indicator light, number, and needle. And if we run SAM 3 head-to-head
needle. And if we run SAM 3 head-to-head with Gemini 3 and Florence 2 almost as a baseline of like where things have been and we see each of the results. First
things first, you'll note that the speed of inference of SAM 3 uh is quite quick.
Um this is just calling the Gemini 3 Pro API. So whatever is provided from hosted
API. So whatever is provided from hosted compute is sort of what you get on the on the response time. And then the second thing you'll note is in addition to speed is some of the accuracy of of
results.
Might get a we might have a timeout error. Let's see.
error. Let's see.
>> Do you have ELO scores?
>> Of what scores?
>> Elo scores like ELO.
>> Yeah, you had you had the arena. Okay. I
was wondering what the ELO was was cuz you said you were blind testing this.
>> Yeah. Um that's actually interesting because we had blind tested SAM 3 before it was released. It's not a SAM 3 just for people to try and compare.
I think we call it like a potential SAG or SEG preview or something. And we
allowed users to vote and they kind of unanimously voted for what they didn't know at the time was was SAM 3. We
actually got like emails of people being like, hey, like where can I use that?
And we just sort of ignored them until the model came out.
But so here with with the responses, you see that um the uh grounding capabilities of of SAM and uh SAM 3
compared to even Gemini uh are out ahead currently. So not only is it doing
currently. So not only is it doing grounding, but if you look closely, you can actually see it's making segmentation masks too, whereas um uh Gemini 3 struggles to do it just does
detection by comparison. Um and then the other thing is just the uh richness of detections like the recall is is high as well as the precision. And if we compare
here it does almost as well, right? But
you see that it misses some of the numbers and has kind of the some of these erroneous boxes that it's that is predicted and then it also doesn't do segmentation. So
it just does uh detection of the task.
So you can envision that the same way the SAM 3 paper introduces the idea of using SAM 3 in tandem with MLMs. I would expect that to be the case pretty soon
and maybe the Google team taking some notes to improve Gemini and other series of models based on what SAM 3 demonstrates here. So in
other words, not only is it faster, but it seems to be more comprehensive for concept segmentation.
And I think the speed actually is a is a huge factor for many use cases. I I
think like even meta we're using SAM 3 for various different product use cases and fast inference speed is very critical to to enable that. And so I
think that's something that I think in many cases you don't even need an MLM for. It's just kind of overkill to use
for. It's just kind of overkill to use an MLM for some applications.
The other interesting thing is the Florence 2 results. And you know, Florence 2 is a little bit older of a model now. So maybe it's not fair to put
model now. So maybe it's not fair to put up headto-head with the state-of-the-art, but it is useful as a way to just see how far we've come. Cuz
Florence 2 by comparison labels the entire region as uh a single class without seeing individual detection of numbers and indicator lights and needle.
And not only that, but it actually runs at about three times the speed as as SAM 3. So SIM 3 again is faster doing a task
3. So SIM 3 again is faster doing a task that the other models are not doing in segmentation and more accurate both in recall and precision of the things that it's intended to find which I think
really showcases the capabilities of of the model.
>> In fact I even got kind of a little surprise about this because this do this more like kind of an OCR like because recognition numbers is nearly OCR. We do
not prioritize the installment of data collection. It works. So we know that it
collection. It works. So we know that it roughly works. But I I think I got
roughly works. But I I think I got surprised that kind of it works so well.
>> That's encouraging. Even a task that wasn't expressly prioritized, it still does a great job on >> Yeah. In fact, during our data engine,
>> Yeah. In fact, during our data engine, we intentionally do not sample OCR heavy images.
>> Wow. on an easier one. Glass mug. SAM 3,
Gemini 3, Florence 2. Sim 3 loaded first and has really impressively it sees even this glass mug in the corner, which I think is something Sim 3 does a great
job of is occlusion and partial objects.
Gemini 3 struggles a bit with this one.
I think maybe because the opacity of the objects by comparison. And then Florence 2 does a good job at finding one of the glass mugs.
So again another type of task that shows the power and veracity of the model.
>> Yeah, I mean exhaustivity like finding every instance is something we heavily prioritized and is really built
into the data engine design. Um, you
know, Penron, you want to talk about how we designed the data engine to really scale exhaustivity because, you know, if a human was to sort of annotate every
single instance, it would take a really long time and and verify, but we put a lot of effort into trying to to automate and speed up that process such that we could get to the data scale and
diversity needed to get to a step change.
>> Yeah. Yeah. I think definitely kind of I would say data engine is the kind of the the critical component that we achieve sensory performance like now. So maybe
we can go to the data engine picture. I
think we have a kind of illustration there.
>> Yeah. Page drives. Yeah. Yeah. Yeah.
Here >> you can see that this is our annotation kind of pipeline. So we first source the kind of the images and kind of generates the nonfreees. So this is the input of
the nonfreees. So this is the input of this task source images and generate kind of nonf faces from for example kind of nama generate caption and we pass the
caption to get the nonf faces. This is
the kind of input distribution. Then we
use kind of sensory model in the loop to generate kind of candidate kind of masks that we kind of oh that's that should be the candidate but it's not perfect
especially in the beginning. Then we go to kind of you can say go to the next step is verification. So sly give you this mass then we need to first do mass
verification to verify each mass whether it's good or not and then kind of after we kind of filter all the bad mass we can there are some good mass left and we
verify whether this kind of this good mass are exhaustive or not like your mark example. So for example the kind of
mark example. So for example the kind of B model do not predict that kind of that partial mark then the exhaustivity check will be kind of failing there kind of
then kind of if kind of the exhaustivity is filled then we go to next step you can see that we can go to the kind of the pipeline go to this kind of
so-called kind of human manual correction human kind of manually annotate kind of all this kind of missing masks you make this data point exhaustive so you can see that
exhaustivity is a very big factor there and we play it as a kind of center place in this data engine and uh but you can see that if we ask human annotator to
annotate every mass from scratch it will take a lot of time I remember kind of each data point in the beginning will take about more than kind of two minutes to finish but if you use model in the
loop then it reduced to about kind of five 45 seconds and you kind of use model to propose mass and then just a few month to kind of to annotate the
missing mass then it's 45 minutes.
Another very key kind of innovation in this data engine is that we really found that this verification steps like to verify a mass is good or not or to
verify now the good mass are exhaustive or not can be done by AI can be done by that multimodel model that is a
breakthrough and then kind of we kind of fetune our kind of for example nama 3.2 to with our kind of verification human anot verification data we get kind of super human performance on this two
verification task and then we do not need human on this two task this further bring our kind of per data point annotation time to about 25 seconds. So
you can see that from the original kind of all human to kind of know about 2 minutes to finally kind of 25 minutes for one kind of data point how kind of
this is kind of our kind of the journey of our data engine to make it super efficient.
>> Did you maintain statistics on how many images were specifically hard? For
example, like we had n many objects that were very difficult oluded or we had some number of images where the comprehensive test was was really hard or did you just bet that by having a
large scale you would encompass occlusion and exhaustive cases? In fact,
we know we kind of maintain this kind of information exhaustivity which one is hard, which one is easy because first and in our data engine when human annotates then we exactly know which
kind of which data point are exhaustivity by the model which part we need a human intervene. In fact we have that kind of uh meta data in our data set. The second one is that the better
set. The second one is that the better kind of the more beautiful part is we have this kind of exhaustively AI annotator. Then we can kind of given a
annotator. Then we can kind of given a new data point we can automatically decide whether this is a difficult uh kind of data point or can uh easy data point by this AI annotator.
>> Yeah, I think that the sort of bootstrapping and annotation story was very strong last time around and it's you know it's even stronger this time.
What what are you going to do when you run out of humans? Like, you know, next year you're going to have superhuman level of everything, right? Like PCS and PBS.
What what then I'm not so optimistic about this and first indeed kind of our current plan for next project is kind of this kind of fully
automated data engine without humans.
That's our dream. I would say that's that will kind of I think that is the kind of perfect thing but still we need some kind of useful information there's no free lunch there's kind of something
kind of no kind of no model can do well and we need human to inject that useful information I would say that what kind of pract
minimal human intervention human only do the task that kind of the model cannot do the most kind of difficult task so that's the kind of kind of first one kind of internal data engine. The second
one is about human performance on this kind of PCS task. A my feeling is that this kind of computer vision is going to enter this kind of when we get to human
performance we will enter this RA HF domain of computer vision. So you can see that language models kind of before kind of in the birth age kind of the
language model are not human performance kind of SFT can really imitation learning really do the job get to very good performance but if you only do SFT and the SFT data is annotated by human
then your performance is bounded by human you cannot get kind of super human performance just by kind of this kind of data engine approach to use human annotate data and then nerf on that you
need to go to this RHF domain And that human really just tell which two point which one is better. This is exactly kind of the philosophy that
can to tell which one is better is easier to really kind of to construct the data point from scratch. So you can get kind of higher performance can can
get better performance from human draw from scratch. I would see that kind of I
from scratch. I would see that kind of I hope that after sensory we can see kind of new research emerge from kind of in computer vision which is okay how we go
beyond human performance sensor is close to that but I would say that new learning paradigm is needed to go beyond human performance for sensory task and for computer vision
>> yeah just to add to that this is pen we're only talking about images I think video is a whole another challenging beast and getting to that fully
automated data engine is something that you know we tried to do in SAM 2 we actually didn't get to >> that fully automated approach in SAM one we did we you know fully A1B data set
that we released was fully annotated automatically we didn't really get to that in SAM 2 for video and in SAM 3 for video I think there's still like a lot
of room to push on this sort of pseudo labeling for video and really be able to to that same step changes we had on images.
>> What are the biggest changes to see the same step change in video that you've seen in images for automated data pipeline?
>> Yeah. Yeah. I would say that running kind of good video not kind of video n multimodel model. So when we do sensory
multimodel model. So when we do sensory is kind of earlier this year or kind of last year you can see that image not multimodal model is very good but video
n multimodal model model I think really kind of it get become good or practical later this year like kind of coinc get kind of roughly kind of okay in that
stage so we have a good kind of base model to fine-tune our data and to get human performance for this recognition or verification task. I would say that
you can see that we need definitely kind of sensory like effort in the perception side but we also need kind of this kind of multimodal n language model kind of
effort kind of good foundation model on the kind of vision language side I think it's ready it's ready now >> yeah also video annotation is just so much more time inensive to to get to
that to you know be able to annotate enough data to train a verifier like video mask annotation We just found it was like very time inensive. So maybe
there are more efficient video annotation strategies. I think there's
annotation strategies. I think there's you know a lot of exploration that could be done there too.
>> Yeah. Uh you know spending a bit of time on video. I wanted to also talk about
on video. I wanted to also talk about you know obviously last time we were focused a lot on memory attention. I
think this time there was this sort of u masullet thing that I I I I wanted to just like get more ideas of or just share the idea just generally. What what
was it called? the muscle detection >> muscle detection score. Exactly. Um and
and how it's basically smoothing within a temporal window which I think basically you know a lot of computer vision models don't have this and they could just simply add it and it'll be a
lot more stable when it comes to video and I don't know why they don't do it.
Yeah, maybe I can comment on this first why they didn't do that. I think um one big reason is kind of the streaming requirement. You can see when you want
requirement. You can see when you want to gather information across the entire mass net then you need to wait for the massness in the end and then and kind of
get the strategy. So that will sacrifice some streaming kind of capability. So
you can see that the streaming requirement is kind of somehow kind of limit we kind of traditional measure to do this. But I would see that this is
do this. But I would see that this is definitely kind of beneficial. The kind
of the reason why is that I think even human do this. You can imagine that when something just appears kind of at the corner of the video like a hand appears at the corner of the window kind of the
video you just do not know whether there's a man or woman. So
human even make mistakes also for sensory it will make this mistakes. But kind of when you get more
mistakes. But kind of when you get more and more information the person really enters the video fully then you get to know okay whether this a man and woman.
So this kind of kind of gather kind of more information to really kind of nail whether kind of this concept is kind of the concept you cage uh is the idea
here. So there is a tradeoff between
here. So there is a tradeoff between kind of the kind of the latency and the accuracy here. If you care more about
accuracy here. If you care more about accuracy, then you can use kind of this kind of overall kind of information can of course the mass net to get kind of
more robust signal about the concept.
But if you care about kind of latency then you you need to make a decision in the very beginning and then you will sacrifice some accuracy. I think also in many video use cases I think Joseph you
were showing on Rooflow users care more about detecting the object rather than having unique identities. So in in some cases this
identities. So in in some cases this maybe it's this isn't required to preserve the identities throughout the video and you just want to essentially
do detection per frame like the RoboFlow rapid examples you were sharing. Yeah,
there's cases where being able to count and you know the objects are all going to be the same. So
you don't care as much about unique classes. You just want to know the full
classes. You just want to know the full presence.
Things like that matter. But then
there's other cases like you mentioned where I don't know like in sport you care about individual players versus just knowing that there's 11 players on the pitch. One thing that might be
the pitch. One thing that might be useful actually to discuss with some of our time is we talked a little bit about how SAM 3 and MLMs will play nicely together, but there's probably like a greater discussion about how SAM 3 fits
into the broader AI ecosystem and like what bigger picture trends it might fit into. Um, do you have some thoughts on
into. Um, do you have some thoughts on what this represents about where things are headed?
>> Yeah, maybe I could say one point and then Pangan feel free to add one. You
know, as we mentioned before, 3 isn't just a version bump. We are really having a unified model that can do many
different tasks in the same unified architecture. And so, you know, in the
architecture. And so, you know, in the same way that LLMs can do many different tasks without needing a task specific model. Like with SAM 3, we're able to do
model. Like with SAM 3, we're able to do image promptable concept segmentation, video promptable concept segmentation.
We can do we don't need a specialist model for counting. We can do interactivity. So really there's like
interactivity. So really there's like multi-capability visual models that are on par or better than the single task state-of-the-art
models. So that's really one place in
models. So that's really one place in which SAM 3 fits into the AI ecosystem.
In terms of MLMs, I don't know if Pron you want to talk about the agent approach.
>> Yeah. Yeah, definitely. All right, I would you can see let me give kind of I would see that su kind of now kind of really get a big step change in vision
how it really helped the general AGI fit into general AGI or frontier model landscape is very very kind of um exciting for me. We always have this
example kind of give this kind of six finger kind of hands up picture ask how many fingers do we had in this picture and then can build all the frontier model C5 and you can imagine that with
sense threeway then we can just kind of first detect how many fingers we have that very robustly kind of six fingers and then the multimodel model should know that okay this is six finger hand
instead of five five finger you can see that they the arrows make made by frontier models can be solved if we use
kind of sensory as a tool. But then how really kind of is sensory as a tool is the end of the picture or should really somehow sensory even just be more
naturally embedded into this fun models.
The frontier models have running the sensory capability by themselves. I I
would see that there's a lot of possibility there. kind of my picture is
possibility there. kind of my picture is that now we have a very good brain with this kind of fun models and we have a very good eye with sensly. Now let's see
kind of whether the eye really kind of is kind of working together kind of natively with the brain together or I is really kind of a different kind of uh organis kind of organ and then need to
kind of somehow like a tool to kind of work with the brain. I think this is a very exciting kind of research area.
>> And so in your analogy, if you think about like the visual cortex compared to like a human human brain like you know we have rods and cones in our eyes that do kind of very fast we joked like lizard brain level detection simple
stuff and then you have your brain that reasons about some of the visual information that your eyes see. In your
example of SAM 3 as a tool call or SAM 3 as natively a part of the multimodal models, which future do you think is
more likely? I think at least I want to
more likely? I think at least I want to bet on and running their work natively together. The future for simple I would
together. The future for simple I would say for simple or even intermediate difficult vision task for example counting with less than 20 objects. I
think for this kind of simple task this is like system one kind of visual reasoning with our brain. This should be kind of our brain kind of should do it by kind of by themsel but with very very
difficult task. You can see that if we
difficult task. You can see that if we are counting kind of maybe kind of thousands of objects kind of in the picture so crowded then we kind of even need to kind of draw something there. I
would see that at that time maybe we need kind of some extra model kind of for difficult task. This is you can see that this is a hybrid approach but I'm more excited kind of I think for most of
the cases should be native. The reason
why there is is you can think that I would say kind of perception or grounding and really kind of know where it is, how many it is. It's like a
fundamental capability of our brain.
It's I'm just not happy that kind of the frontier model just cannot count how many fingers immediately and instead of need to call a tool to do that. I think
this kind of should be system one thing and this should be kind of natively in our brain and also if our brain cannot do this task which means that it's
definitely kind of missing some kind of very critical kind of visual capability by itself. So that's kind of I would say
by itself. So that's kind of I would say that it's just feels that the intuition just feels that it's not correct to do not have this capability by itself. So
for very simple system one questions, things like how many fingers on a hand, that should be native. But for maybe more complex things that are maybe longunning tasks and longunning reasoning, then maybe there's a bit more
of like a tool call approach.
>> Yeah. Yeah.
Exactly. For example, you can see that we already kind of in our sensory agents or kind of in our AI annotator, we even demonstrate this approach kind of for simple cases. The model can do it by
simple cases. The model can do it by itself that okay I can detect for example 10 people here and then the natur language model can even the AI annotator can even know that okay this
10 people is not exhaustive okay there are more people there so if you want to do kind of well then maybe kind of you need to do more step for example kind of to call an extra model so you can see
that this is a very very kind of native kind of uh kind of reasoning process for for more advanced or complicated vision questions I have A related but maybe
slightly different question.
M3 is an incredibly powerful piece of work. Um and it's open source as a part
work. Um and it's open source as a part of uh now MSL opensource critical to achieving AGI.
>> Maybe I can comment on SAM specifically but in SAM 3 we did leverage many of the open-source contributions people have made on top of SAM 2. There were new
there were new data sets, there were new benchmarks, there were new kind of inference time optimizations. We adopt a lot of the things that the community
built on top of the models, on top of the data sets. Um, and so the all those contributions helped make SAM 3 for SAM
series. We really benefited a lot from,
series. We really benefited a lot from, you know, being very generous with what we open source and then leveraging what the community builds on top of that.
But that that's just from the sound perspective. I think it's clear what the
perspective. I think it's clear what the community brings and offers and I think uh you know every time we do this we always shout to the community to like you know try it on their use cases and
report like weird findings and like you know if it doesn't do what you are trying to make it do well let's let's talk about it right and then maybe sort of implement it in the next version like you already said uh you already hinted
at like what might be coming for SAB 4 which is at least a little bit more of the document and OCR work any other directions are interesting I guess obviously a lot more video work as well.
What's what what is the talk of the top the town in in like the CV community that like you know oh it be really great or like super obvious like next year is going to be the year of what? Yeah,
maybe kind of I can first talk something and then I can add first definitely kind of I think even it's not s four it's s three something and s 3 point something
like small models s currently only have really kind of one model kind of one size model kind of more kind of efficient model that's kind of fit for kind of eight cases and also kind of a
more efficient model for video I think currently kind of the video model is not efficient you either you can't achieve very good kind of throughput But you need GPUs to do that. So first kind of
small and efficient models. That's one
kind of big thing. The second big thing definely kind of video. Rob can do that for you.
>> Yeah. Yeah. The second thing is video. I
would say that we video is still far from I would say have a big gap from human performance. Right now there's
human performance. Right now there's kind of still kind of a lot of research need to be done there. how to do end to- end training with video. We do not have kind of we have this kind of decoupled
approach but we do not end to end train this model and we expect definitely kind of it will be kind of uh benefit from kind of end to end training and also as we kind of on video side really kind of
how to scale up the data engine we need kind of definitely kind of AI annotators for video we tried that but yeah we kind of I I think that's that's something and
definitely uh worthwhile to do the third one we also discuss about that all sense really how perception fit into AGI this big landscape. Now we have the eye how
big landscape. Now we have the eye how the eye work with the brain to do to solve real kind of reasoning task not only output segmentation but really kind
of answer how many kids are here or even answer the question. Okay I can I have an example of kind of bio biology labs kind of the robot need to decide whether
can they can liquid in the can test tube is kind of at the kind of correct level or not. You can see that this is kind of
or not. You can see that this is kind of involved perception but also involves reasoning. How to kind of solve this
reasoning. How to kind of solve this more kind of beneficial reasoning task with sun is kind of a very big direction.
>> On the robotics topic, it was exciting to hear from like several friends that work at, you know, different robotics companies on how they're like immediately starting to use SAM 3. And I
I think especially for the video use case, I think robotics is probably one of the domains where I think improving video performance will have a lot of
impact. And so I think yeah, that's
impact. And so I think yeah, that's definitely an area that we could improve on further. But yeah, to Ben's point, I
on further. But yeah, to Ben's point, I think there's still another step change to be achieved on video PCS. So
>> yeah, just just a quick comment on the robotics things. We we're interviewing a
robotics things. We we're interviewing a bunch of robotics folks here as as well as like Fifi who obviously started imageet. Uh a lot of people are betting
imageet. Uh a lot of people are betting on explicit world models and Sam is not for better or worse and I and I wonder when that crossover might happen. That's
that's an open question if you want guys want to take any role models uh discussions uh re where things are going based on like community questions.
Similar how Nquila mentioned Dr. Sam one the like almost obvious thing that people wanted was like open concept prompting because people are like great this model can see things but I want to tell it what I want it to see and now
with the introduction of SAM 3 you have this stepwise component which feels like a key component of you know the chatgbt era for vision is is arriving as a
result what's going to happen is now you've provided people with an open text box and media and so you're going to get all sorts of queries from people that maybe the model isn't primed to be able
to perform particularly well on yet. For
example, earlier we were talking about document understanding and document reasoning being a place where there's known improvements to be made. And so
you'll have people that will probably prompt to try to OCR things or you'll have people that want to do work with spatial reasoning like give me the object to the left of this other object or give me a sense of where things are
in relation to one another which is critical for robotics like we're discussing because that's how you navigate throughout the real world.
You'll also have, I think, people will want action recognition and vision language action models, VAS, like the same things that where you have these tasks where people are used to providing open text prompts and getting here's the
part of the scene where the player kicked the ball or the tennis player made the serve. Those are interesting for the purposes of how to understand and synthesize visual inputs. And so now
that you've kind of given this open text box for media, there's going to be a flood of the types of things users are going to want to try to do. some of
which SAM is already going to be really well adapted to do, some of which not.
And I think that that's going to be it's going to reveal itself of the types of things that are that are obvious. One of
the things that we wanted to discuss was like where to um use SAM and and discover how to build with SAM. So in
addition to the meta team building a tremendous playground for being able to interact with images and video and kind of apply effects with like a video emphasis I think one of the things that
we're pretty excited about with SAM 3 is how much it positively impacts each part of building a system for visual understanding. So for example, the very
understanding. So for example, the very first step of historically aggregating and collecting a data set because you think that there's not a model that understands the slice of the world that you want to understand is where
automating away lots of labeling can exist. Basically, if you collected a
exist. Basically, if you collected a bunch of data of something that is already in the SAM 3's knowledge, then you can prompt for SAM 3 to automatically label all that data for
you. And so we've actually made a bet on
you. And so we've actually made a bet on SAM 3 being a core part of autolel at roofflow giving users a first pass of saying hey if you have a new image or you have a new video start providing
just a text prompt and allow SAM 3 to find and automatically label those regions of interest for you downstream.
I think there's areas for fine-tuning like you know within a week of releasing SAM 3 Med SAM 3 came out for adapting SAM into medical contexts and I think
that's a harbinger of what's to come like there will be lots of domain specific adaptations of SAM in places where maybe there's a specific ontology that someone wants to understand or maybe there's a place where just the
model doesn't have great awareness yet and I think we're already beginning to see that with hundreds of fine tunes that users are creating for various domains And then the last area is like, okay, I've got my model, now I want to use it.
And so one of the things that we're really proud of is to be ready on launch day to showcase the infrastructure we've built to burst and scale like infinitely large as folks have uh models that they want to deploy and make it readily
available. Having an endpoint that
available. Having an endpoint that serves either a fine-tuned model or a model as is or even a model that might be able to run on edge hardware as smaller models come out or maybe distillation comes to rise is I think
also an awesome place of where we're seeing SAM 3 being impactful uh each part of like the computer vision life cycle and pipeline.
>> That's awesome. Yeah, I think especially the impact on speeding up annotation.
And I think we've seen that consistently on RoboFlow and I'm really curious to see how um SAM 3 the introduction of SAM 3 really helps uh speed up that process
even further. I mean just from playing
even further. I mean just from playing around with it, it's so much faster than having to manually annotate every single object. So yeah, you're really curious
object. So yeah, you're really curious to see how that improves the experience.
One of the things that we were pretty excited about is we were kind of able to build an entirely new product in the world of of SAM 3 and we called it we called it rapid but basically it's like
there's probably a model that already understands the objects in the world that you want to see. So, here I'm I'm screen sharing an example of like these are vehicles next to our office in San Francisco that go by and you can see here's a Whimo and here's like other
vehicles and like if I just have like this 10-second clip and let's say you know the first thing I want to do maybe is just like count cars and I want to get a sense of like each of the
vehicles. What's really awesome is I can
vehicles. What's really awesome is I can just, you know, of course text prompt and say I want vehicle. And as I toggle through different frames in my video, SAM 3 already recognizes and understands
those objects. Now, one thing that I
those objects. Now, one thing that I think is really interesting, there was a conversation earlier about how much you want to rely on a model versus humans
output of the model for what you care about. So, for example, let's pretend in
about. So, for example, let's pretend in this scene, maybe the only cars that we care about are the ones that are like before the crosswalk and maybe not far in the distance. Then you'd get people that would say, "Hey, you know what? I
actually want the objects that are like most confident." And I would like, you
most confident." And I would like, you know, move my slider down to like getting a fewer number of objects.
Whereas maybe others might say, "Hey, I want like every single presence of a potential object in the scene, which even gets like reflections on the building of objects." As computer vision approaches this world where we
increasingly have like models that can understand and improve themselves and we rely on what human output and human preference from the models is, we're going to get these funny scenarios where things aren't also all like immediately
deterministic of what a human cares about. And I think that's where like
about. And I think that's where like tooling fills a big gap, but it also is going to be a place where it'll be really interesting to see where users kind of start to use and apply the
models and why you need sort of this last mile work to put the model in context in the domain that someone is trying to solve and tackle. Um,
>> so let me let me since you're here, right, this is one of those things where I'm like I'm not sure this concept concept the concept of labeling concepts
can scale only because I don't know if I ever if if this slider between less and more is the way if uh ultimately I need to tell you whether or not to include uh
reflections, right? Cuz the reflections
reflections, right? Cuz the reflections sometimes is great. That's exactly what I want. Most of the time it's not going
I want. Most of the time it's not going to be what I want. I don't know if some RHF thing is going to solve any of that because it you just need more prompting.
Just just saying vehicle is not going to do it. Yeah. I don't know feel free to
do it. Yeah. I don't know feel free to disagree.
>> You can imagine such kind of such pipeline coming for example as kind of Swik said that maybe kind of the reflection is exactly what I want then
you need some kind of iterations with the kind of the interface or the model or to kind of to get finally what you need. So you need to specify the
need. So you need to specify the concepts kind of more clearly through kind of multiple iterations. Can human
not be involved in this iteration but just kind of models in just kind of do it automatically. I think that's kind of
it automatically. I think that's kind of something definitely kind of it's I would see I'm quite interesting that you can imagine this workflow and then I want kind of reflections and then I can
kind of with the kind of the default kind of threshold maybe kind of the kind of the model will get an output then another kind of very strong kind of
perception model or other kind of >> like kind of Gemini 3 will then kind of ask we ask Gemini 3 whether there's kind of some reflection here and it says yes
then we can then you can see that we can automatically not going to move the threshold kind of lower and ask can again and again to see whether the reflections now are included or not. So
somehow this process can possibly should be done kind of completely with AI kind of kind of as unless yeah yeah exactly.
So for now the answer is image and we can we can sort of tie it tie it closer.
>> I think I think Joseph is showing us the the sort of weo annotation. Yeah, it's
it's nice. Now you have a way model.
>> Yeah, I was just doing an example where maybe we want to find an object that's not already represented in the training data. Yeah, I think I think prompting
data. Yeah, I think I think prompting could solve >> Yeah, I think prompting could solve the problem of like reflections cuz maybe you could say like vehicles on the
street, but to your point like you would have to like see that that's a failure case, right? Like if I was like just
case, right? Like if I was like just setting up a camera and saying count cars, I wouldn't anticipate realizing that reflection could be a problem. And
so I think this is why like in some ways human in the loop because identifying human intention, not necessarily human knowledge is what's going to be important for a lot of last mile use.
>> But yeah, I'm I'm pretty excited about >> Yeah, maybe I want to echo kind of what Joseph said.
This is also my experience just different people have quite different kind of definition of even a visual concept. For example, for some kind of
concept. For example, for some kind of data set even hand some people would like to kind of just kind of annotate the palm kind of part as kind of the hand and some people will kind of
include the arm kind of also kind of as hand. Then when we kind of first test
hand. Then when we kind of first test three on kind of some very kind of customized data set we found okay the performance is not that good and when we kind of finally look into kind of the kind of performance we found okay this
is kind of just the user have a diff different definition or explanation of the concepts but kind of both explanation are okay then in this case you can see that really need a few in
loop to do the kind of few short fun tuning or to adapt to the user's definition of this concept That's exactly right. It's not always like
exactly right. It's not always like deterministic of what someone really wants. Which is why I think like
wants. Which is why I think like even if you have a fully comprehensive omnisient model, putting the model into the context of what the user is trying to do is where a lot of tooling and
infrastructure becomes really really helpful. Anyway, I found I found our
helpful. Anyway, I found I found our Whimos. You continue to build like
Whimos. You continue to build like excellent tooling for for vision and I think the world is very grateful for that. Let's get to call to action. And
that. Let's get to call to action. And
uh you know I think you know we we we've sort of given it a good overview and people obviously should read the paper and try out the playground try out RobFlow if they're if they're interested
in diving deeper. What is your call to action for from each of you?
>> I mean try the demo try the code. We've
we've got a lot of resources on GitHub repo. Um and you know
repo. Um and you know >> it's a very well managed launch by the way like kudos. I don't know this probably takes a lot of effort just on the launch itself even after the model's done. Yeah, and actually just on that
done. Yeah, and actually just on that maybe one thing just a shout out to the whole team. I think this was M3 was our
whole team. I think this was M3 was our biggest and most ambitious project to date and it really took a huge team of scientists engineers interns software
engineers, you know, across across the company. So, you know, really huge shout
company. So, you know, really huge shout out to the entire team that made not just the model successful, but also the demo and then all the the launch and everything. So, it was a a huge team
everything. So, it was a a huge team effort. definitely like would love to
effort. definitely like would love to hear from people on what you're using the models for, where it's failing, you know, raise GitHub issues, message us on Twitter. We, you know, love to hear from
Twitter. We, you know, love to hear from you on on where we should go next as well.
>> Yeah. And on top of that, definitely going to try out also our benchmark, the cycle benchmark. I would say that it's
cycle benchmark. I would say that it's likely that the benchmark will not simulate model. Maybe kind of next year
simulate model. Maybe kind of next year there will be a stronger model. But the
benchmark is kind of the one that I hope to guide the community to kind of get better and better models kind of to get to a kind of we measure human performance on the benchmark. I think
maybe we are the first one to do that for this kind of very kind of segmentation and kind of video kind of grounding in past. It's very difficult to measure human performance on this
task. Hopefully kind of this benchmark
task. Hopefully kind of this benchmark can guide the community to achieve human performance for this task and even kind of surpass human performance there.
>> We uh we set out to be one of the best places if not the best place to build with SAM 3 and the SAM family models. So
we're eager to see what people build with SAM and computer vision models to move the whole field forward. We have
infrastructure for everything from deploying SAM 3 zerosot to making your own fine tunes to automate automating labeling of data with SAM and we continue to see the impact with each subsequent release expand the number of
use cases and the amount of use and accelerate the time to value. So excited
to see what folks can build on on Robbo with Sam. Thank you all so much like
with Sam. Thank you all so much like this is uh you know really great coverage, great work and just uh obviously like always expands my mind as to what is possible with machine
learning. Yeah. I mean, you know, it's
learning. Yeah. I mean, you know, it's we're not we're not at ASI yet or AGI yet, but every day we're getting closer.
>> Awesome. Thank you so much.
>> Thank you. Thank you.
Loading video analysis...