Building Feedback Driven Annotation Pipelines for End to End ML Workflows
By Voxel51
Summary
## Key takeaways - **Tight Curate-Annotate-Train-Evaluate Loop**: The golden focus is thinking about how we can have an extremely tight curate annotate train and evaluate loop with our visual AI models using 51 and 51's in-app annotation capabilities. [00:53], [01:03] - **Visual AI Needs Manual Ground Truth**: Visual AI is critical in the sense that it relies on ground truth. It is not self-selective and self-labeled in the way that large language models or text-based models are. [04:24], [04:34] - **Core-Set Beats Full Labeling**: Using the classic ImageNet data set, labeling just 10% of that data on the initial iteration resulted in 54% accuracy outperforming 100% label-based methods of 100% of data. [17:06], [17:26] - **Random Sampling Misses Rare Cases**: If you have a data set that's 90% daytime, 10% nighttime, random sampling naturally would give you nine out of 10 samples being daytime samples, but nighttime is where models failing. [16:05], [16:15] - **Prioritize False Negatives in Iteration**: False negatives expose gaps in your training distribution and are difficult to identify in practice. Use embeddings to look at model false negatives relative to ground truth in your test set and see the cluster where the model is failing. [20:14], [20:27] - **30/70 Coverage-Targeting Balance**: Balance coverage and targeting, one approach is about 30% for coverage and about 70% for targeting. Coverage ensures diversity via embedding space analysis, targeting mines samples similar to failures. [21:20], [21:54]
Topics Covered
- Curation Beats Annotation
- Core-Set Outperforms Full Labeling
- Prioritize False Negatives
- Balance 30% Coverage 70% Targeting
Full Transcript
All right. Hello. Hello everyone.
Welcome. So glad you can join here today. Um if you are on this call, you
today. Um if you are on this call, you are here for the uh Voxil 51 webinar on annotation pipelines. really kind of an
annotation pipelines. really kind of an end-to-end uh feedback um annotation focused workshop and best practices uh
using the 51 platform. Um so happy to have you here. Uh my name is Nick Lots.
Um I uh work with the community team here at Voxil 51. Um and uh welcome uh again. Uh over the next hour or so um we
again. Uh over the next hour or so um we are going to do a um workshop on building as the title suggests feedbackdriven annotation pipelines.
Really uh the golden focus of this workshop is thinking about how we can have an extremely tight um curate annotate um train and evaluate loop with
our uh visual AI models using 51 and 51's inapp annotation capabilities. Um
whether you are brand new to the 51 platform um or whether you've been using it for a while um this this will have something for you. Um uh we have just released uh inapp capabilities on the
platform or inapp annotation capabilities on the platform and we'll be taking um a good look at that today
in context with the rest of um uh the platform's uh N51's capabilities. So um
again, my name's Nick. Um we'll be here together over the next hour. Couple bits
of logistical items. Um again, thanks so much for those who have attended. Um if
and when you have any questions, we will have definitely Q&A uh available here at the end of the call. Uh at the same time, there is a Q&A panel in your Zoom toolbar. Um sometimes you need to click
toolbar. Um sometimes you need to click the little uh three dots that say more when you hover over the toolbar. Um and
then you can click Q&A. Um, I'll be keeping a look at that as questions come in. If they're, you know, you know,
in. If they're, you know, you know, relevant in real time, I'll do my best to get to them in real time. At the same time, um, as they come up, maybe towards
the end of the talk, I'll be there'll be some Q&A time here at the end as well from an assets and kind of takeaway resources perspective. Um, there are a
resources perspective. Um, there are a couple, uh, things that I will leave you with. I'll introduce them early in this
with. I'll introduce them early in this workshop. Um much of this workshop is
workshop. Um much of this workshop is built around a uh end-to-end tutorial we wrote that's available in the 51 documentation. So the to tutorial walks
documentation. So the to tutorial walks through um using a multimodal data set not only to curate an annotate data but also go through a training uh and
evaluation an iterative training and evaluation loop. um we take a um kind of
evaluation loop. um we take a um kind of test set strategy to make sure we're constantly incrementally improving um the data that we're labeling and and obviously improving the models in the
process. Uh we take an algorithmic both
process. Uh we take an algorithmic both a actually a qualitative and quantitative approach to data selection.
So we'll take an embeddings based flow to try and uncover um data that is both uh diverse and also data that is causing potentially models to fail. Um and we'll
try to ensure that we're being as efficient as possible in our data selection strategies. And then we'll
selection strategies. And then we'll also take a quantitative approach as well. Uh we'll look at what's called
well. Uh we'll look at what's called core set selection which is um a way of uh again looking at the diversity space of your data and choosing the the most
unique samples uh that are most worth labeling. And again we'll kind of keep
labeling. And again we'll kind of keep that tight flow in the process. Um the
all these resources uh will be available as uh follow-ups after this um workshop as well. The slide deck, the um
as well. The slide deck, the um tutorial, um the uh recording will will all be available after this as well um in a follow-up email. So if you've registered, you'll you'll get all these
as a follow-up as well. Okay, so let's take a look at where we are going. Uh
we're going to hop right into things here and let's take a look at the problem space. So ultimately the initial
problem space. So ultimately the initial problem we'll be looking at is why annotation workflows based uh breakdown in the first place. Visual AI is critical in the sense that it relies on
ground truth. It is not um self-
ground truth. It is not um self- selective and self-labeled to kind of use a less technical term in the way that things like large language uh models or textbased models are right. We
have to tell the model what is the correct uh data that we are labeling.
Couple problems with that. One is
historically it's very int uh very labor intensive. um foundation models have
intensive. um foundation models have helped, vision language models have helped. Uh but at the same time, um
helped. Uh but at the same time, um again, it's expensive and and ideally if as data sets grow larger and larger and larger, they're uh they live in data lakes and so forth, we want to make sure
that we are choosing the best samples possible to label and that it's continuing the process. Anyone who's
built models before know that it's extremely iterative in the sense that you're pushing the boundary slightly further with it with each iterative training loop. Uh a training a oneshot
training loop. Uh a training a oneshot model through a kind of single linear path um is not normally going going to get you the best performance. It's it's
a method of of continuous improvement.
So we're going to look at that loop and see what feedbackdriven annotation looks like. Kind of the without burying the
like. Kind of the without burying the lead. It'll look at seeing okay where is
lead. It'll look at seeing okay where is uh where is the most uh diversity in our unlabeled data and focus that for
labeling and second is which uh which type of data is causing model failures of particular interest are false negatives because those are uh famously
difficult to identify um using quantitative measures. there's some good
quantitative measures. there's some good qualitative measures and um that'll be a big focus of what we do here and we're going to specifically focus on on
multimodal data. So we are going to um
multimodal data. So we are going to um take advantage of 51's group uh group data set model to see when we have a collection of um maybe different camera angles along with 3D point clouds that
are associated together in what we call group samples how can we ensure that we're building an effective loop and then lastly we'll end on on architecture patterns and again there'll be kind of a
demo interlaced throughout this as well where I'll be um looking at some of the highlights of uh the tutorial that I've linked and kind of showing kind of doing a walkthrough of some of the main um
elements of that getting started guide which again is really meant to be kind of an edte flow of curate train um uh sorry curate annotate train evaluate uh
within the 51 app um if you want to kind of get a sense of where we're going um and if you even want to get started um this tutorial is linked here at the bottom um that takes you to this page
this page this annotation getting started guide uh there are kind of two tracks to this tutorial there's like a quick start where it just gives you the down and dirty of inapp annotation and
then there's the folder track that goes through that whole again uh curate annotate train evaluate flow. So if you were to go to this guide again just docs.box1.com/getting
docs.box1.com/getting
started uh there is also a QR code that is at in at the end of the slides at the end as a takeaway. Um I'll I'll share that then. Um again you have a couple
that then. Um again you have a couple tracks. The quick starts track is really
tracks. The quick starts track is really focused again on installing 51 loading a data set um loading a multimodal data set getting started with annotation and
then the remaining steps again are uh very focused on that that broader flow that I talked about earlier and then again we'll end with some Q&A
as well.
So let's talk about the annotation problem. Um, so we are biased here at
problem. Um, so we are biased here at Voxil 51 towards datacentric AI where our um, premise that we've we' argue has
largely been uh, seen to be correct is that tweaking model architecture these days has less than an impact than ensuring that we're training our models
on good data. At the same time, a purely uh spray and prey approach to the historically very manual and intensive
uh effort of data annotation um doesn't necessarily improve your model, right?
Part of it is issues with data selection, right? So if we think about
selection, right? So if we think about like classic example of like self-driving data um if our model fails at night but twothirds of our data set
is uh during the day and therefore twothirds of our labeling efforts are on daytime samples well that's not what's causing our model to fail in the first place. What matters is selecting the
place. What matters is selecting the data to label and label correctly in so far that it improves model performance without overfitting.
What often goes wrong is that annotation um lives outside the machine learning workflow, it often relies on um outside vendors a lot of the time whereas again even though again uh
foundation models and and and internal tooling is uh progressively improving that um there often isn't feedback from evaluation to reabeling uh often the communication loop between okay here's
here's our model evaluation metrics precision recall F1 mean average precision how does that actually map back to improving our data selection and and our annotation. And then also models
need specific data to fix specific errors. That often requires expertise
errors. That often requires expertise that um again um a an approach as simply as labeling as much as possible at at best um incidentally,
you know, can can get resolution on. But
ultimately, when you have models failing for a specific reason, the best way to fix it is actually fixing the specific data that is that failure mode in the first place. So there's a couple ways we
first place. So there's a couple ways we can approach this. Um there's the front end where data curation becomes increasingly important. That is when we
increasingly important. That is when we have massive and massive amounts of data that we have access to. Labeling all of it 100% just simply isn't possible. So a
key engineering problem on the front end is curating curating that data into a subset that'll give us a really good initial shot of a high performing model at the beginning and then iterating on
that. Once we have that subset of data,
that. Once we have that subset of data, more on those selection techniques in a little bit. What we can then look at is
little bit. What we can then look at is then how do we then go back and improve upon that subset? How can we choose additional samples to label based off our current state of one, what is our
current diversity in our data set with respect to what we would want in real world conditions? And number two, what
world conditions? And number two, what is the current failure modes of the model? Where is the model finding false
model? Where is the model finding false positives? In my view, more importantly,
positives? In my view, more importantly, where is the model finding false negatives? Where are we finding uh
negatives? Where are we finding uh confusion where there's class confusion uh between certain um objects or or
whatever uh type of um uh labeling that we're doing on our data set. So curation
and then iteration.
And as I mentioned before, so much of this comes from a garbage in garbage out problem. When you don't have a good
problem. When you don't have a good strategy for data curation, often the result is uh misleading metrics and then poor generalization in production. When
your model is not trained over a uh on on ultimately a correct data set that moves performance, the result is poor performing models in real life. Um the
result is missed edge cases because the model was never appropriately trained on that in the first place. Um and then there's also the idea of silent degradation and this part is critical
where ground truth is our um kind of first principles. what is the correct
first principles. what is the correct identification of cars of people or whatever objects we want to identify in our data set. Um incorrect labels silently degrade model performance because we are operating on the
assumption that the ground truth is correct. you have to um however we
correct. you have to um however we therefore need strategies to go back and check to see okay does the ground truth actually have problems such that we need to iterate on our on our labels and
that's largely where a really tight loop of inapp uh annotation can come in where you don't have to you know export uh to an external service um so that you can
you know more quickly iterate in real time within the platform in which you're curating training and evaluating So again the shift to not put a too too fine a point on it is curation before
annotation. So the traditional route is
annotation. So the traditional route is you might label everything uh perform a whole bunch of annotation then train on the model and then QA kind of might be an afterthought in the sense that um
it's difficult to kind of re reverse the loop and go back into annotation when you don't actually know what in the data is causing your model to fail. So
there's a little bit of hoping to the for the best and that's what this bottom left graphic is is showing. Um right uh you have a pool of unlabelled data. You
label everything. You train the model and then you know maybe more of your effort in this case will be on the model architecture side because that's what you know you have control over. What we
want to take is more of a feedback driven approach. We can use curation
driven approach. We can use curation techniques like embeddings and then uh uh what we'll call zcore or corset sele uh zerosophet selection to find the
highest value samples that are worth annotating and then train the model based off those samples. The idea here is that we are using very objective criteria to determine what data is
likely to have the most generalizable model on the front end and then we evaluate and find where is our model failing and then from there repeat the loop. Right? If we can identify the
loop. Right? If we can identify the parts of the embedding space that our model is not performing well on, we can redo the flow, identify additional
samples and uh either qualitatively by just examining the embedding space visually or by performing methods like core set selection that we'll talk about and then choose additional high value
labels for reanitation.
And the idea is we're having a very metrics driven approach for continuous model improvement.
So the new way rather than just grabbing random samples is to compute embeddings first so we can have a uh both a qualitative and quantitative understanding of our data visualize the
distribution and then select what to label. So part of using the using 51,
label. So part of using the using 51, which we'll look at in a little bit, is before we've even applied a label in the first place, we'll take our unlabeled
pool of data and compute visual embeddings and try to find things like outliers um or representative segments of the embedding space that we know we want
captured in our initial data set. A
technique that I'll show is by applying core set selection on top of that we can color the embedding space uh by um the the by what we call the
zcore that is uh which basically is a uniqueness value that allows us to see which parts of the embedding space are most worth our attention in generating
an initial set of uh samples to label.
So a question just came into chat. Do uh
do we apply stat statistical approaches for electing high for selecting highv value data? Um yes that that will be
value data? Um yes that that will be that will be the case. Um it's it's it's algorithmic. Um but I'll show you here
algorithmic. Um but I'll show you here in a little bit. Uh how we can apply both not just the embeddings computation and visualization but also core set selection to actually color this space and kind of it's not heat maps but
there's kind of a color gradient we'll see that allows us to point out okay here are here are unique values with high diversity. Here are values with
high diversity. Here are values with high diversity. we can select those and
high diversity. we can select those and then curate a subset from there.
And to that point, um this is borne out in research. So um
we found here at Voxil 51, our machine learning team has found that smart selection almost always uh and actually for many tasks beats random sampling. Um
so an example would be let's say you have a data set, it's 90% daytime, 10% nighttime. Well, random sampling
nighttime. Well, random sampling naturally would give you nine out of 10 samples being daytime samples. But if
nighttime is where models failing, random sampling is not the most efficient approach because random sampling reflects frequency and not diversity. So the rare but critical
diversity. So the rare but critical cases go under representative. So it
might not even just be nighttime samples. It might be samples of
samples. It might be samples of pedestrians walking alone at night. So
we can use core set selection to add unique coverage. So if we have redundant
unique coverage. So if we have redundant data samples, those will score low whereas more sparse re regions will score high. So we can get a more
score high. So we can get a more representative data set. And um one one result uh you can read the uh the full paper for details. Um it's linked in
kind of the human uh in the annotation blog post that I'll share in a little bit. But using the kind of classic
bit. But using the kind of classic imageet data set 10% of that um labeling just 10% of that data uh on a on the
initial iteration of a model gave resulted in 54% accuracy for that model um outperforming uh 100% label based uh label based
methods of 100% of data. So there we found that downstream models perform as good or if not if not better in certain cases than full data labeling um through
this core set selection practice.
So what does that mean in practice? What
what would it actually look like in our workflow? So during annotation the idea
workflow? So during annotation the idea is focus on is have focused batches. So
label say 50 curated samples not 10,000 random ones. And again, the idea here
random ones. And again, the idea here is, you know, these these are kind of round numbers. Um, but this will be an
round numbers. Um, but this will be an iterative process where we'll be continually adding additional samples that are worth labeling. Um, in addition is enforce your schema heavily. Um, so
for example, make sure that when you are using an annotation platform like we'll see in 51 in just a little while, um, ensure that you are subjecting your annotators to only using the schemas
that you allow. So make sure that you're rejecting if you know vehicle is attempted to be hard-coded when your schema only supports truck as you know a bounding box that you can draw around an object.
Um in addition you need to make sure that your samples are aligned and this is something 51 natively supports through what's called grouped data sets.
So when you're thinking about um say the equipment on a self-driving car you have multiple camera angles and then you might have LAR which would be like a 3D 3D cuboids and a point in a point cloud type sample. you need to make sure that
type sample. you need to make sure that those are aligned together within a single sample. Um so that so that you
single sample. Um so that so that you know everything from time series um to model recognition is again aligned within that within a single view. And
then when it go when it comes to then training the way our tutorial is going to approach things is you might start off with a very small sample set. Again, 20
is just kind of an example I do here. It
would probably would likely be larger because again, if we use algorithmic methods, it's pretty quick to identify the samples needing labeling. Um, but
the idea is we have, you know, a certain subset of samples with expert verified labels.
There's a uh hopefully automated schema check. So, again, we're enforcing the
check. So, again, we're enforcing the schema, but we perform quality assurance on it as well. Um and then in addition before training a coverage check to sanity check things like making sure that okay if samples have zero labels on
them at all uh you know catch this before you know it corrupts the training pipeline.
And so that's part one. Now remember I said there are two key parts to this datacentric approach. Part one is
datacentric approach. Part one is curating a diverse data set where underrepresentative regions are uh represented even though they're underrepresented in in your total data set they become a critical part of the
subset and then the second is using your model as your debugging tool in an iterative approach. So train the model
iterative approach. So train the model based off your initial selection and then listen to what it tells you. Um so
false positives will obviously reveal neo label noise or ambiguous cases. Um
those are fairly easy to identify. Uh
what is uh more intriguing is false negatives because those expose gaps in your training distribution. They're also
um kind of the unknown unknowns of the process. They are difficult to identify
process. They are difficult to identify in practice. This is where embeddings
in practice. This is where embeddings can help. Um so when you look at uh the
can help. Um so when you look at uh the um model false negatives relative to ground truth in your test set, you can look at the embeddings distribution and see okay where is the where where is the
cluster in which the model's actually failing. And then the last piece is uh
failing. And then the last piece is uh obviously your confusion matrices. Where
does your schema need refinement because the model is confusing one object for another. Again, that's a native part of
another. Again, that's a native part of of the 51 platform as well here as we'll see in uh just a minute. So again, this is going to be one of several iterative
loops. um this becomes a feedback signal
loops. um this becomes a feedback signal for your next labeling batch where from here you'll choose based off a combination of what is still under represented in the data set and where is
the model failing you'll then choose your next batch.
So what's what's a good what's a good balance here? Um there is no universal
balance here? Um there is no universal answer. It is it is well established in
answer. It is it is well established in the literature that you need to balance coverage and targeting. um one approach is about 30% for coverage and about 70% for targeting. Naturally, this would be
for targeting. Naturally, this would be a little a little iterative because um you'll adapt this based on how your model is performing and where you're finding uh influence on one side or another. So when I say coverage, I'm
another. So when I say coverage, I'm talking about diversity and this is where that uh an analysis of the embedding space and core set selection will come in. Uh this will be uh ensuring that we have a balanced
distribution of data to prevent model o uh and as well as preventing model overfitting. And then the other two/3ish
overfitting. And then the other two/3ish is targeting where we're actually mining samples similar to your failures so that we can improve the model on the specific errors your model is making. Like again
maybe the model can't identify stop stop signs in the snow, right? The idea is okay well we need examples of that to specifically tell the model this is what a stop sign looks like when it's covered with snow.
Um so again again round numbers in a budget of 100 labels per iteration. We
might see 30 samples where we're covering new regions of the embedding space and then maybe 70 samples from um failure mining again where we're finding clusters of either false positives or
I'm finding we're finding more intriguingly false negatives.
So closing the loop we're looking for an approach that looks something similar to this. So from a cold start our initial
this. So from a cold start our initial selection we don't have a model yet right so we can't do targeting at the beginning we have to focus entirely on coverage so we're focusing purely on
embeddings and uh other al algorithmic techniques to ensuring that we have a diverse subset of data so we can get a baseline model with broad coverage and
then as we continue to iterate we're going to um still focus on coverage but then increasingly focus on model targeting and as we add addition additional samples to our data set.
We're kind of approaching um not sometimes asmmptoically approaching that 7030 split that I talked about where 30% is based on coverage, 70% is based off
targeting failure modes in our model. So
we can rapidly improve uh based again on model failure modes and then again as we do later later iterations we're going to focus on okay when are we starting to plateau when are we starting to reach diminishing returns um and then we have
kind of additional you know approaches from there where we might need to focus on model architecture we might need to focus on um our total pool of data that we've been sampling from and continuing
to improve that. um we might we might uh start to approach uh different modalities again um this you know because this this exact progression I'm talking about you know on its own can
only last so long but we've seen good results with core set selection.
All right so we're I'm going to drop into kind of a a a screen share demo here in just a moment but before I do so just a couple
things to keep in mind. Um this is very uh uh conceptual when it comes to doing this in practice. Um couple couple points here. So first of all um
points here. So first of all um maintaining a frozen test set is critical right. So um there are
critical right. So um there are opportunities in 51 to have different views clones of data sets um and ensure that one you know set isn't leaking into the other. That's obviously critical
the other. That's obviously critical here because then you can't trust your model otherwise. Um tracking what's
model otherwise. Um tracking what's labeled, QA and ready for training can be done through tagging methods uh saved view meth saved and sharable view methods. Um measuring um efficiency can
methods. Um measuring um efficiency can be useful that is what is the number of labels uh in our samples per change and mean mean average precision. Uh this
should be taken a little bit with a grain of salt especially if you're using um model assisted approaches to labeling. uh because at that point
labeling. uh because at that point labeling becomes a little bit less of a tedious expensive exercise when you start using foundation models increasingly but it can still be useful
uh again especially at the beginning and then what's also useful is being able to annotate without leaving the curation envir environment right so um 51 particularly the enterprise version of
51 includes quite a lot of capabilities when it comes to data set versioning uh access control um exportable data sets within the app.
Um, having kind of a having a similar flow where you don't have to leave the app and you can kind of measure your data set versions and the progress on them from both an annotation perspective and a training evaluation perspective
within a single uh kind of single commit so to speak of your data set is uh is ideal so you don't have to kind of like you know be on a swivel back and forth between the state of your annotation
platform and the state of your training and evaluation loop. So, there's quite a few primitives that we'll show here in just a moment.
So, what does this look like? Um, this
next demo is going to be adapted heavily based off of the uh tutorial um that I linked earlier and again is can be found here at docs.fox51.com
under the getting started guides. It's
this annotation guide here. Um, for
those that want to follow along or ultimately go through this tutorial, um, right here is the code that you'll also find at the beginning of the tuto tutorial to get started quickly.
Everything I'm going to show is, uh, available, uh, in this flow is available in the open- source version of 51. It's an
installable Python library, literally just pip install 51. Um, it it includes a local web server to serve the UI. It
includes also a built-in MongoDB um database. Uh that's that'll just be
database. Uh that's that'll just be hosted locally as well. And then 51 also includes um a data set zoo of easily importable data sets to get started. So
there's literally a method you can run load zoo data set. You can find all the different zoo uh data sets we support kind of natively importable from the zoo in the docs. But the one we're going to
use to get started is this uh kind of cryptically named quick start groups data set. What this actually is, it is a
data set. What this actually is, it is a um subset of the kitty kit ti multimodal self-driving data set. Um so this uh subset of it includes um it's not huge.
It's about 200 samples. Um but it's useful for uh demonstration purposes of uh left camera views, right views, and uh 3D point clouds. And again, I'll I'll
swap over here in just a moment as well.
So when you pip install 51 and you load the data set, what does it actually look like? Well,
like? Well, give me a moment here.
It looks something like this. So the 51 app launched. Here is
this. So the 51 app launched. Here is
the server. Um we are pretending that this is a let me actually do something just pretend here for a moment. I'll
switch to another data set here in kind of iteration of this in a moment. Here
is what it looks like in the UI. You can
switch left view. Right camera view is like kind of what you're seeing in the sample grid.
You can view point clouds in the sample grid. But if I click into a particular
grid. But if I click into a particular sample and I and I open it up here, here is like the full distribution. Um,
couple things to note again. Notice in
this grouped data set modality we see whoops sorry give me one second here.
Got it. There we go.
We see the full point cloud view. We see
cuboids aligned with aligned with bounding boxes on both the left hand left camera and right camera views. Um
these are all mapped through a sample ID. Um and uh mapped in in this time
ID. Um and uh mapped in in this time series data as well. Um so this is kind of the explore pane. Uh what you'll also
notice here in uh as of last Thursday when uh 51 uh was uh when 1.13 of 51 was released is this annotate tab. So we can
see in this annotate tab this is where the annotation loop will come in where I have my schema that I can either modify through either this graphical interface
or if you've used the 51 SDK before here is the JSON schema that you can um update as well. And then within here I
can choose to say add a new bounding box. Maybe I want to tag
box. Maybe I want to tag a car. Me do that here.
a car. Me do that here.
There's my car. And now that becomes literally just a new label that is added to the data set. Um right now we support um classification and detection labels
um for both uh 2D and 3D. So, if I were to go to the oop, sorry, let me go back here for a moment. Let me change the annotation
moment. Let me change the annotation slice to PCD.
Let me find a car here.
On a similar front, if I were to find my schema here and I were to add a cuboid,
let's see where we at here. car
imperfect example with this initial drawing. But here we can see I've you know I've labeled this car here and then you can you can just as easily drag and edit existing boxes
as well.
So inapp annotation is now again fully built into 1551 where this becomes most powerful is when this becomes a QA method with the larger loop that I
discussed earlier. So what I'm going to
discussed earlier. So what I'm going to do now is I am going to go to a copy of this data set that I've kind of set up
as actually no it's right here that I've kind of set up into different views to kind of work our way through the um
um flow that I discussed earlier. So
where we are going to start is let's say that I have a raw unlabeled pool. So
again I'm using a round number number here. Let's say my pool of data is 130
here. Let's say my pool of data is 130 samples. Now where did I get this from
samples. Now where did I get this from in the first place? So couple useful things that 51 has. Um even starting here uh this might be drawn from an even
larger supererset of data like say a data lake. 51 includes some pretty
data lake. 51 includes some pretty powerful integration uh with uh with those sorts of features like your data warehouse or your data lake. data
bricks, big query, snowflake um where for example if uh in this data set here we have a feature called data lens where before I even have this initial like
pool of samples I can open this data lens feature and if I have things like vector embeddings configured in my data lakeink I can search for samples
that I want to add to like my initial kind of superset pool of data that I will then uh subset from there. So maybe
I have like pedestrians walking at night.
I might search the vector. Uh maybe if I have like you know natural language search here, I can search the space and I am querying my broader data lake in
natural language for these samples.
So this will take just a moment and I'm not going to go through the whole thing here, but you see the idea. Now these
are are are labeled, right? So these are labeled samples. Um, but even if this
labeled samples. Um, but even if this was if this was an unlabeled pool, um, it's not using these labels to to find them. It's using natural language search
them. It's using natural language search and I could import these in. But let's
say I have this pool. Um, number two is how do I select from there? Well, couple
uh, the most powerful way is to compute embeddings. So, if you were to go into
embeddings. So, if you were to go into the let me pull this up real quick.
And just to save a little time, if you were to go into the tutorial, one moment here.
Sorry, I was grabbing the right step.
This is step three of the tutorial.
you can run some code to in generate to compute embeddings on your on different slices of data. And this is a native um method in 51. You can compute the
embeddings themselves and then compute the visualization based off the embeddings. And then optionally so that
embeddings. And then optionally so that that'll give you a really good visual of your embedding space. And then
optionally you can also run um zeroot core set selection as well. again linked
here in step three of the tutorial where it basically uh it gives a uniqueness score to each sample in the data set. So
what you end up with is something that looks like this. So here is my raw embedding space again just 130 samples but we can kind of extrapolate from
here. And then also useful is if I then
here. And then also useful is if I then color by zcore, the uh more into the
green and yellow we are, the more unique the data and the more interesting it would be to add to our subset of our initial training data. So for example,
we can see kind of around here is probably worth taking a a snag at here. it's worth snagging maybe a little
here. it's worth snagging maybe a little bit up here. Again, 130 samples, so not a huge embedding space. Um, again, the larger the data set, we could
potentially see other areas that are worth selecting. And from there, you can
worth selecting. And from there, you can that can be your initial subset. So we
might then pretend that we select maybe from here and then maybe some additional areas and then we save that
into a view that maybe we're calling our initial subset. We might call it batch v
initial subset. We might call it batch v 0. And this batch v 0 we can see again
0. And this batch v 0 we can see again small example there's 37 samples here.
So you know a small subset of the total 130 samples and it's been grabbed from this area of the embedding space.
And again the idea here is this is unlabeled data. This is not a huge lift
unlabeled data. This is not a huge lift potentially for our annotators to go through. Um 51 will always support
through. Um 51 will always support exporting to like particularly open source annotation tools like seat and others. Um we have very deep integration
others. Um we have very deep integration with thirdparty annotation tools but now the platform itself includes Whoops. So
if I open up a sample, the ability to annotate these directly as well, define a schema and have it all done in app. So
I could easily do that here.
All right.
So this becomes our initial batch and then it's from here that we can train our uh first model. Again, uh for the sake of time, I'm not going to go through the whole model uh initial model
training loop here. Um note that this is also well supported in 51. Um if I go to the next step in the tutorial actually to step five.
So step four here is uh annotation QA.
Uh you can again create cuboids and 3D annotations in the app as well.
What you can then do is then you can train your first model. Uh the tutorial example has us fine-tuning um a YOLO V8 model and then from there
we can say that okay now we have let me go back oh sorry let me go back to here a view
where we might call this our evaluation set. Maybe it has 30 samples where we
set. Maybe it has 30 samples where we see there are now predictions on it. And
this is our first pass, right? It's kind
of hard to see these these are in gray here, but these are predictions that are overlaid with the ground truth. Um, good
question here in the chat. Does Zcore
work the uh it's not the exact same algorithm as 51's uh compute uniqueness function. Um it is another um uniqueness
function. Um it is another um uniqueness related um method. Um um I would I would uh the I would need to direct you to the
um actual paper uh for the deep details on how it works. Um the GitHub uh it's not the GitHub repo but um the blog I'll share as a resource here in a bit um
links to the blog that's focused on on uh Zcore. Um it is not exactly the same
uh Zcore. Um it is not exactly the same as the uh as the way the k nearest neighbor of the compute uniqueness function works that there are some differences but it is related. Um
compute uniqueness is still very powerful as well. It allows you to sort your data by uniqueness. Um find the most unique samples for later labeling.
Zero shock foret selection is is slightly different but related.
Okay. So now I have this validation set.
We have predictions overlaid.
Me go to explore here. Sorry, let me go back here for a moment.
Here's left.
For some reason, this is showing the PCD. Give me one second. That's why.
PCD. Give me one second. That's why.
And this is still showing the PCD. I
don't know why. Give me one Give me one last refresh here.
Ah, type error failed to fetch. What's
going on? This site can't be reached.
Great. One second.
I think I think my local host connection just died here. I can flip over to another one here in just a moment and show you another example while we're
waiting. In fact, while that is
waiting. In fact, while that is happening, while I'm relaunching the 51 app, which again, I think there's some gremlins in the demo system. Uh, real
quick, while um while I'm back here for a moment, just like a quick interlude, um remember I mentioned that human annotation is obviously critical. Um we
are definitely going to need to rely on experts to um annotate data. At the same time, foundation models are increasingly
supported as well. Um 51 does include um autolabeling via foundation models as um as a native function in 51 enterprise.
Um the reason it is enter in enterprise and not natively in open source is that it does require what's called delegated operations which are ways of delegating work to um attached GPUs. But the way that works but the way it works in
practice in your flow is you can obviously open up a sample annotate it manually. But also on top of that you
manually. But also on top of that you can go to this autolabeling panel and choose a target set of samples.
What type of detection you're of what type of labeling you want to perform a foundation model to do the labeling the classes you want to label in your
data set.
the minimum confidence threshold and again within the app you can apply your labels there as well. So again when I say that we're performing an annotation loop I really mean human in
the loop annotation with potential varying levels of automation.
Okay, let me see if I can reconnect to my system here.
All right. Um I think there was another question saying when do oh when do you expect to support segmentation and in se yeah yeah when do you when do you uh intend to support
segmentation masks within the built-in annotation. So uh segmentation is
annotation. So uh segmentation is supported within autolabeling. Um in
terms of the inapp label editor um it is it is a roadmap item. It is something we we eventually intend to support. um
right now you know brand new it is it is classification and object detection but segmentation is definitely on the road map I I don't have a firm date at this time but it is definitely something
we're developing and and prioritizing all right so looking at this validation set we can see okay we've applied predictions and ground truth from here
we now want to get our initial set of model metrics right this is this is our our first iteration here so we go in um depending on which version of 51 you're
using. If you're using the enterprise
using. If you're using the enterprise version, there's a built-in uh um inapp method of uh performing model evaluation. Um if you're in open source
evaluation. Um if you're in open source 51, there's an SDK method just like everything else shown here where you literally just run evaluate detections or similar method depending if you're
doing classifications, detections, um and it computes uh critical model metrics with respect to your ground truth. Now again in our case this was a
truth. Now again in our case this was a sample set of 130 samples of a particular slice in our case. Um and
then we and then and I and I fine-tuned a YOLO uh fine-tuned a YOLO model on commodity hardware. So I would not
commodity hardware. So I would not expect in this first iteration that the model would be whoops super high performing. But if I look at look at my
performing. But if I look at look at my initial initial metrics the answer is correct. It is not super high performing
correct. It is not super high performing uh whatsoever. At the same time, we can
uh whatsoever. At the same time, we can start to get some initial senses of where is the model performing, where is there class
confusion. And I would particularly
confusion. And I would particularly suggest that false negatives and um class confusion is where you'd want to start. And that's this is where we be
start. And that's this is where we be can then begin to approach our uh 3070 split of from here on out. Let's try to approach 3070 of okay looking at the
embedding space. So let me go back. So
embedding space. So let me go back. So
maybe we can find a lot of instances where let's see there's false negatives or false positives on certain things. So I might go into the embedding space and say okay
where are nighttime or pedestrian samples clustered and try to see if there's false negatives. So, I might go back here.
Rather than coloring by corsot, uh, let me actually go back to my pool.
Rather than coloring by course shot selection, I might color by Oh, you know what? I didn't add I would need to add the the ground truth label
to my embeddings computation. I didn't
do that here, but similar idea. I
apologize. You actually do this in the tutorial. I just didn't do it here, but
tutorial. I just didn't do it here, but it would highlight by class and then you might start to select clusters and then choose additional samples from there. And then it just becomes an
there. And then it just becomes an iterative loop, right? So you have your pool, you might augment that pool further from a data lake using a tool like data lens. Um and then um go
through the loop of annotate a few key samples. Train a model based off those
samples. Train a model based off those annotated samples.
Run model evaluation metrics. Find where
your model is performing poorly in ter in terms of course statistics or class confusion. And then again take an
confusion. And then again take an approach of okay where are the types of samples where the model isn't performing well.
Again, highlight those in the embedding space and then tag them for your next batch and then go from there. So, where
ultimately does this take us? Well, it
takes us into a couple places. So, let me quickly go back to my slide deck here.
First, we need to take note of a couple failure modes, right? One thing that I kind of took for granted is that our training and test sets are properly
segregated, right? You need a frozen
segregated, right? You need a frozen test set otherwise you can't trust any of the work that you're doing. The
second is be very very careful about label drift over time. Um, gatekeep your schema heavily. Make sure that um
schema heavily. Make sure that um labelers aren't adding again new labels that weren't originally um addressed by the model. Um make sure
you include a golden set of data that you know is 100% correct in its ground truth. Don't chase only edge cases,
truth. Don't chase only edge cases, right? Um that can result uh in in some
right? Um that can result uh in in some cases in model overfitting towards those edge cases cases. that uh divided approach of h maybe a thirdish towards
coverage 2/3ish towards model failure modes is important. Um again when you have that tight loop you can kind of easily kind of make sure you approach
that that ideal of whatever you choose.
And then again from a QA bottleneck standpoint um anyone who you know writes software trains models knows that eventually you need to ship often you need to ship fast you always need to
iterate. um creating focused views in 51
iterate. um creating focused views in 51 um inapp annotation fixes like we now support um is is the way to go here um because again very often exporting to a third
party annotation tool requires an entire uh different pipeline here we can keep everything compact within the platform.
So um before I uh move forward here a couple questions or I guess another question here I see in chat uh what type of multimodal data can voxil work with?
how to best annotate that. Uh for
example, 3D molecule data, description, pictures of Yeah. So um from a general
data support perspective, 51 supports um obviously 2D images, LAR data, 3D point
cloud uh data of of any type. Um uh and we support um an annotation of uh 2D and and and 3D data. Um so uh classification
detection cuboids that's kind of our initial pass um of of supporting annotation. Uh the goal is to very very
annotation. Uh the goal is to very very quickly increase the um annotation support from a general data support. So
curation embeddings training models um pretty much virtually any type of 2D and 3D data. Uh we also support video um as
3D data. Uh we also support video um as well as video frames naturally because we support 2D images. Um we also support audio and audio specttograms. Um and we're steadily increasing kind of our
coverage of what we support there as well. Um right now what's supported in
well. Um right now what's supported in uh inapp annotation is currently a subset of the total types of data modalities 51 supports but our plan is
very much to have that subset of annotation supported modalities like reach the total superset of of modalities we support across um all
parts of the app.
All right.
So, what are a couple key takeaways? Um,
I mentioned this already. Um, and then I'll kind of leave us with kind of a couple lessons learned and then where to go from here. Um, like I said before, everything I've discussed is uh available uh within the open source app
with the exception of of the autolabeling panel that I that I briefly opened. Um and then also technically the
opened. Um and then also technically the data lens panel is um something that is enterprise only as well but definitely curate annotate model training model evaluation is all supported native
natively really I'd say if there's one thing that is most critical of everything I discussed aside from the inapp annotation is the embeddings engine right using embeddings is
critical for uh again having a quantitative uh and then qualitative visualization of the embedding space where you can just have a bird's eye view of your data distribution and then
using zeroot core set selection to actually um color the embedding states of highv value samples. Now it's worth noting that um zcore is is really just
it's a mathematical algorithm. So this
um this uh GitHub repo uh in in the Voxil 51 org um is a useful way to get started. And then the tutorial actually
started. And then the tutorial actually has you run core set selection on the data uh within 51. But right now it's not just a it's not just a button that
you press. It is a method, an algorithm
you press. It is a method, an algorithm you actually run um with the SDK. Um
there is work on getting uh native plug-in support for it though, so it can be easily run through the app UI as well. Um then again add add targeting.
well. Um then again add add targeting.
So as you train and evaluate models evaluate failure modes um based off core model metrics class confusion and then update your next batch with additional
samples to label that you can again again label in the app uh natively with um manual annotation in app or with autolabeling. Um if if you are a 51
autolabeling. Um if if you are a 51 enterprise customer um again approach that 3070 balance flow and then again QA every iteration
um make sure you maintain that kind of golden data set of what you know are 100% correct samples uh to catch any type of data drift over uh compare with
your again enforce a common schema as well. So again um uh annotators aren't
well. So again um uh annotators aren't able to kind of go rogue with the types of labels they are producing. And again
if you notice what I was showing within the inapp annotation you you saw that you can enforce you know radio buttons and kind of drop down selection to make it easier on your annotators.
All right. So where to go from here? Um
there's a couple useful resources that that will be helpful. Um, number one is the conceptual flow of this entire talk
was based off of these um was based off of uh this tutorial. Um the uh the link is right here. Uh for convenience, I also included a QR code um if you're
interested.
And then if you want just more general kind of conceptual details about the uh way we've incorporated uh annotation into the platform. Um whether you're new
to 51 or you're an existing 51 user.
This is kind of this is this is a new feature in 51. Uh 51 uh at its core was a cur data curation engine um for for most of its existence. Um it still is.
it's still still our bread and butter and our our our our you know we posit that that's the best way to get better models is to start with really really good data curation um annotation is
meant to only like provide an engine to that right to make the flow tighter more iterative and make it easier to add edit and and um apply labels to your data set. So again, this blog post goes into
set. So again, this blog post goes into more detail as well. Um, if you would like to learn more, but that being said, I want to leave the
last couple minutes here for any additional questions that folks may have. I saw uh I think Joseph, you had
have. I saw uh I think Joseph, you had your hand you had your hand raised. Um,
you're you're free to answer uh you're free to ask a question if you still have one. Um, either uh here or in the Q&A,
one. Um, either uh here or in the Q&A, whatever you feel most comfortable with.
Um, happy to field any other questions as well.
Um, actually I don't know if folks might be in listenon mode on the call, so if you're trying to talk and you're muted, you might need to use the Q&A or or chat version.
I'm also looking back through the channel as through the chat as well.
Yes. Uh, statistics are I see some okay some note takers. I think I answered most of these.
Awesome. Um, another really good resource just in general to learn more, again, if you go to voxil 51.com/anotation,
51.com/anotation, that's kind of a landing page for you to get more context on how we've included annotation in the platform. Uh, for just getting started, your best friend is
going to be doc docs.voxul51.com.
Uh, that is going to be your best friend for installing 51. Uh, getting started with both the open source and enterprise versions of the software. There is a whole host of you'll notice getting
started guides uh industry specific guides, tutorials. Um they're all are in
guides, tutorials. Um they're all are in self-contained notebooks.
So um if you kind of want to, you know, download run them yourself, look at the GitHub source. Again, this is all meant
GitHub source. Again, this is all meant to be very contained. They almost always start with installing 51 from scratch if you haven't yet. So it's again meant to be meant to be easy for you to get going.
Yeah. So couple couple good questions about embeddings. So you'll notice when
about embeddings. So you'll notice when you look in the um 51 methods that uh
the the model you use for computing embeddings is uh toggable. Um so here is uh comput hold on compute visualization.
Let's see. Torch vision.
For some reason, I might I might be missing where the method is here, but literally the method, if you look at the method, it's called compute embeddings.
Um, and then you would compute the visualization based off those embeddings. You can pass in whatever
embeddings. You can pass in whatever model you choose. It is not fixed in 51.
51 is extremely open and integrable with respect to that.
Um, so you are not you are not tied into particular uh any into any particular architecture.
Um, next question. How to uh annotate multimodal data consisting of 3D molecular data and tissue pictures. So,
um, the the the three 3D annotation I think for the most part should support that.
It's going to depend on exactly what your modality looks like. Um, so again, we support classification and detection for both uh 2D and 3D data including 3D
point clouds. So when you are um
point clouds. So when you are um annotating say molecular data again it would depend on exactly your workflow there could be some classification um
involved um but again uh 3D cuboids are supported as well um uh there if you look at the data set zoo
there might be some healthcare related data sets that you can look through. We
also have full hugging face integration.
So, Joseph, my recommendation is um see if you can kind of parse through the zoo, look for maybe some healthcare related data sets. They're very easy to load uh if it's on hugging face as well and can kind of take a look and see if there's any that kind of map with your
needs and what it would look like within the an annotation feature.
Um, okay. Can I use my own embedding model for semantic search? Where will
the model be hosted for creating a feature vector from the search term? So,
there's there's a couple couple approaches for that. So um when you use 51 for any type of searchability uh
computing uniqueness computing um what we call mistakeness on the data set for any of those that are based off embeddings models the flow is you compute embeddings you literally pass in
um the model which in in in local 51 would be would be hosted locally and would need to run on local GPUs and then um from there you would pass it into the 51
method. for what we call similarity
method. for what we call similarity search or uniqueness or um or any of the others. Um that's an interesting
others. Um that's an interesting question. Um I might I might actually
question. Um I might I might actually follow up with you on that um so I can get a better answer because I might be slightly misinterpreting what you're saying. Um but yeah, when you're using
saying. Um but yeah, when you're using local 51, the model's almost always going to be hosted locally and it's going to run on local GPUs. Uh and but yes, you can use any arbitrary model for
computing uh embeddings.
Um, all right.
Next question. I'm interested in taking a snapshot for taking the top three relations based on a codebase.
Um, not fully understanding your question.
Uh, Ratatin, sorry, sorry if I mispronounced your name. Um, 51 fully supports data set versioning in its enterprise feature. I'm not exactly sure
enterprise feature. I'm not exactly sure if that's what you're referring to, like snapshotting a data set. Um that can be done through
uh in 51 enterprise there is native inapp support for creating almost basically a git style version of the data set. Um you can
naturally use other open source tools like DCS for data set versioning. Um, or
you can create an a another kind of hacky way you can do it in open source 51 is by creating separate data set views and then exporting them using an SDK method to create separate data sets
as well. I'm not exactly sure if that's
as well. I'm not exactly sure if that's what you're asking here. So, so I apologize. Uh, but there are there are
apologize. Uh, but there are there are multiple methods in which you can kind of version and snapshot existing views or even entire clones of of 51 data sets.
All right.
All right.
So um generally, yeah. So an annotation drift
generally, yeah. So an annotation drift um to to the last question here is a kind of a broad term that um really refers to like differences in like changes in labeling guidelines which
really means changes in schema that causes uh labels to become less reliable over time because the criteria to label in the first place have changed. So that
could take a couple forms. Number one is the actual label names themselves you know might change or drift over time. So
it was previously just like vehicle maybe a generic term maybe over time became car truck maybe what was considered a car or a truck and the annotation guidelines have changed. So
now you have a mixed data set of like vehicle car truck but like the model progressively comes becomes confused because you have multiple um it's basically the language is being changed and therefore the model's understanding
will change. Um another way that
will change. Um another way that annotation drift can happen is like the criteria for how you do the labeling changes. So maybe the tightness of a
changes. So maybe the tightness of a bounding box changes over time. It might
go like tighter fit to to looser with like more wiggle room. Um uh what is considered um classific you know in in in edge cases like blurry images or dusk and
dawn photos like what's considered labeling of like night versus day might slightly change over time as different personnel does the labeling. That's
really what we mean by annotate annotation drift. And what it really
annotation drift. And what it really refers to is the model the model's objective understanding of what is being labeled has changed over time because what is considered objective ground
truth has has slightly changed over time as well. Now in some cases that can
as well. Now in some cases that can actually make the model more robust if it is able to kind of generate a better understanding through multiple kind of edge cases that are labeled in different
ways. It can actually become kind of
ways. It can actually become kind of again like smart in that case. But on
the other side, if um but on the other side, it can become also confused if there's very clear cases where two labels on very similar objects are shown to be two different things based off
changes in lab in annotation guidelines.
Yeah. Uh Orin, great question. How would
you approach using embeddings when the scene is very busy? Obviously, it can cluster day and night, but what if the differences you are looking for are in the objects and not the scene? Very,
very good question. So something we did not directly address in this talk is the idea of um patches. So let's see if we actually have it set up here. We might.
Do I have a patch view here?
I don't quite have a patch view. Um
there's a method you can run called two patches in 51 that notice we have not notice each of these samples have multiple bounding boxes in them. What
the patches view does is it creates one patch per object. So for example, like a lot of these scenes might have like a dozen different uh labels on them. What
patch views do is they allow you to create um a sample for for every label.
So what and then when you compute embeddings based off of that, the embeddings become more granular on a per object basis. So when you are looking at
object basis. So when you are looking at a particular embedding, you're saying, "Okay, that is the sample that specifically has like a car in it. It
has other things as well, but there's different samples for those other things because of the patch view we created."
Um, so when you're looking at very busy scenes like that, um, you can create multiple patch views that you can then visualize the embedding space on. So
yes. Yeah, great. You're you're correct.
Yeah. So a good starting point is you would use a foundation model to create some initial uh like bounding boxes and then you can create patch views of those bounding boxes and that creates that
does a mapping of one sample per object within the sample grid and then the embedding space if you then run embeddings on that particular view after you do the patch conversion it'll
illustrate that uh space as well. Great
question.
All right.
Any any other questions?
All right. Well, again, uh we are at time. Thank you so much for your
time. Thank you so much for your attendance and participation.
Again, uh we'll be we will follow up with um some of the resources that I mentioned at both the beginning and the end. Um highly recommend Pip Install 51.
end. Um highly recommend Pip Install 51.
Give human annotation a spin. You can
learn more about it in the resources I mentioned. Uh but otherwise, enjoy the
mentioned. Uh but otherwise, enjoy the uh rest of your Wednesday. Thanks.
Thanks so much everyone.
Loading video analysis...