Building Feedback Driven Annotation Pipelines for End to End ML Workflows

By Voxel51

Summary

## Key takeaways - **Tight Curate-Annotate-Train-Evaluate Loop**: The golden focus is thinking about how we can have an extremely tight curate annotate train and evaluate loop with our visual AI models using 51 and 51's in-app annotation capabilities. [00:53], [01:03] - **Visual AI Needs Manual Ground Truth**: Visual AI is critical in the sense that it relies on ground truth. It is not self-selective and self-labeled in the way that large language models or text-based models are. [04:24], [04:34] - **Core-Set Beats Full Labeling**: Using the classic ImageNet data set, labeling just 10% of that data on the initial iteration resulted in 54% accuracy outperforming 100% label-based methods of 100% of data. [17:06], [17:26] - **Random Sampling Misses Rare Cases**: If you have a data set that's 90% daytime, 10% nighttime, random sampling naturally would give you nine out of 10 samples being daytime samples, but nighttime is where models failing. [16:05], [16:15] - **Prioritize False Negatives in Iteration**: False negatives expose gaps in your training distribution and are difficult to identify in practice. Use embeddings to look at model false negatives relative to ground truth in your test set and see the cluster where the model is failing. [20:14], [20:27] - **30/70 Coverage-Targeting Balance**: Balance coverage and targeting, one approach is about 30% for coverage and about 70% for targeting. Coverage ensures diversity via embedding space analysis, targeting mines samples similar to failures. [21:20], [21:54]

Topics Covered

Curation Beats Annotation
Core-Set Outperforms Full Labeling
Prioritize False Negatives
Balance 30% Coverage 70% Targeting

Full Transcript

All right. Hello. Hello everyone.

Welcome. So glad you can join here today. Um if you are on this call, you

today. Um if you are on this call, you are here for the uh Voxil 51 webinar on annotation pipelines. really kind of an

annotation pipelines. really kind of an end-to-end uh feedback um annotation focused workshop and best practices uh

using the 51 platform. Um so happy to have you here. Uh my name is Nick Lots.

Um I uh work with the community team here at Voxil 51. Um and uh welcome uh again. Uh over the next hour or so um we

again. Uh over the next hour or so um we are going to do a um workshop on building as the title suggests feedbackdriven annotation pipelines.

Really uh the golden focus of this workshop is thinking about how we can have an extremely tight um curate annotate um train and evaluate loop with

our uh visual AI models using 51 and 51's inapp annotation capabilities. Um

whether you are brand new to the 51 platform um or whether you've been using it for a while um this this will have something for you. Um uh we have just released uh inapp capabilities on the

platform or inapp annotation capabilities on the platform and we'll be taking um a good look at that today

in context with the rest of um uh the platform's uh N51's capabilities. So um

again, my name's Nick. Um we'll be here together over the next hour. Couple bits

of logistical items. Um again, thanks so much for those who have attended. Um if

and when you have any questions, we will have definitely Q&A uh available here at the end of the call. Uh at the same time, there is a Q&A panel in your Zoom toolbar. Um sometimes you need to click

toolbar. Um sometimes you need to click the little uh three dots that say more when you hover over the toolbar. Um and

then you can click Q&A. Um, I'll be keeping a look at that as questions come in. If they're, you know, you know,

in. If they're, you know, you know, relevant in real time, I'll do my best to get to them in real time. At the same time, um, as they come up, maybe towards

the end of the talk, I'll be there'll be some Q&A time here at the end as well from an assets and kind of takeaway resources perspective. Um, there are a

resources perspective. Um, there are a couple, uh, things that I will leave you with. I'll introduce them early in this

with. I'll introduce them early in this workshop. Um much of this workshop is

workshop. Um much of this workshop is built around a uh end-to-end tutorial we wrote that's available in the 51 documentation. So the to tutorial walks

documentation. So the to tutorial walks through um using a multimodal data set not only to curate an annotate data but also go through a training uh and

evaluation an iterative training and evaluation loop. um we take a um kind of

evaluation loop. um we take a um kind of test set strategy to make sure we're constantly incrementally improving um the data that we're labeling and and obviously improving the models in the

process. Uh we take an algorithmic both

process. Uh we take an algorithmic both a actually a qualitative and quantitative approach to data selection.

So we'll take an embeddings based flow to try and uncover um data that is both uh diverse and also data that is causing potentially models to fail. Um and we'll

try to ensure that we're being as efficient as possible in our data selection strategies. And then we'll

selection strategies. And then we'll also take a quantitative approach as well. Uh we'll look at what's called

well. Uh we'll look at what's called core set selection which is um a way of uh again looking at the diversity space of your data and choosing the the most

unique samples uh that are most worth labeling. And again we'll kind of keep

labeling. And again we'll kind of keep that tight flow in the process. Um the

all these resources uh will be available as uh follow-ups after this um workshop as well. The slide deck, the um

as well. The slide deck, the um tutorial, um the uh recording will will all be available after this as well um in a follow-up email. So if you've registered, you'll you'll get all these

as a follow-up as well. Okay, so let's take a look at where we are going. Uh

we're going to hop right into things here and let's take a look at the problem space. So ultimately the initial

problem space. So ultimately the initial problem we'll be looking at is why annotation workflows based uh breakdown in the first place. Visual AI is critical in the sense that it relies on

ground truth. It is not um self-

ground truth. It is not um self- selective and self-labeled to kind of use a less technical term in the way that things like large language uh models or textbased models are right. We

have to tell the model what is the correct uh data that we are labeling.

Couple problems with that. One is

historically it's very int uh very labor intensive. um foundation models have

intensive. um foundation models have helped, vision language models have helped. Uh but at the same time, um

helped. Uh but at the same time, um again, it's expensive and and ideally if as data sets grow larger and larger and larger, they're uh they live in data lakes and so forth, we want to make sure

that we are choosing the best samples possible to label and that it's continuing the process. Anyone who's

built models before know that it's extremely iterative in the sense that you're pushing the boundary slightly further with it with each iterative training loop. Uh a training a oneshot

training loop. Uh a training a oneshot model through a kind of single linear path um is not normally going going to get you the best performance. It's it's

a method of of continuous improvement.

So we're going to look at that loop and see what feedbackdriven annotation looks like. Kind of the without burying the

like. Kind of the without burying the lead. It'll look at seeing okay where is

lead. It'll look at seeing okay where is uh where is the most uh diversity in our unlabeled data and focus that for

labeling and second is which uh which type of data is causing model failures of particular interest are false negatives because those are uh famously

difficult to identify um using quantitative measures. there's some good

quantitative measures. there's some good qualitative measures and um that'll be a big focus of what we do here and we're going to specifically focus on on

multimodal data. So we are going to um

multimodal data. So we are going to um take advantage of 51's group uh group data set model to see when we have a collection of um maybe different camera angles along with 3D point clouds that

are associated together in what we call group samples how can we ensure that we're building an effective loop and then lastly we'll end on on architecture patterns and again there'll be kind of a

demo interlaced throughout this as well where I'll be um looking at some of the highlights of uh the tutorial that I've linked and kind of showing kind of doing a walkthrough of some of the main um

elements of that getting started guide which again is really meant to be kind of an edte flow of curate train um uh sorry curate annotate train evaluate uh

within the 51 app um if you want to kind of get a sense of where we're going um and if you even want to get started um this tutorial is linked here at the bottom um that takes you to this page

this page this annotation getting started guide uh there are kind of two tracks to this tutorial there's like a quick start where it just gives you the down and dirty of inapp annotation and

then there's the folder track that goes through that whole again uh curate annotate train evaluate flow. So if you were to go to this guide again just docs.box1.com/getting

docs.box1.com/getting

started uh there is also a QR code that is at in at the end of the slides at the end as a takeaway. Um I'll I'll share that then. Um again you have a couple

that then. Um again you have a couple tracks. The quick starts track is really

tracks. The quick starts track is really focused again on installing 51 loading a data set um loading a multimodal data set getting started with annotation and

then the remaining steps again are uh very focused on that that broader flow that I talked about earlier and then again we'll end with some Q&A

as well.

So let's talk about the annotation problem. Um, so we are biased here at

problem. Um, so we are biased here at Voxil 51 towards datacentric AI where our um, premise that we've we' argue has

largely been uh, seen to be correct is that tweaking model architecture these days has less than an impact than ensuring that we're training our models

on good data. At the same time, a purely uh spray and prey approach to the historically very manual and intensive

uh effort of data annotation um doesn't necessarily improve your model, right?

Part of it is issues with data selection, right? So if we think about

selection, right? So if we think about like classic example of like self-driving data um if our model fails at night but twothirds of our data set

is uh during the day and therefore twothirds of our labeling efforts are on daytime samples well that's not what's causing our model to fail in the first place. What matters is selecting the

place. What matters is selecting the data to label and label correctly in so far that it improves model performance without overfitting.

What often goes wrong is that annotation um lives outside the machine learning workflow, it often relies on um outside vendors a lot of the time whereas again even though again uh

foundation models and and and internal tooling is uh progressively improving that um there often isn't feedback from evaluation to reabeling uh often the communication loop between okay here's

here's our model evaluation metrics precision recall F1 mean average precision how does that actually map back to improving our data selection and and our annotation. And then also models

need specific data to fix specific errors. That often requires expertise

errors. That often requires expertise that um again um a an approach as simply as labeling as much as possible at at best um incidentally,

you know, can can get resolution on. But

ultimately, when you have models failing for a specific reason, the best way to fix it is actually fixing the specific data that is that failure mode in the first place. So there's a couple ways we

first place. So there's a couple ways we can approach this. Um there's the front end where data curation becomes increasingly important. That is when we

increasingly important. That is when we have massive and massive amounts of data that we have access to. Labeling all of it 100% just simply isn't possible. So a

key engineering problem on the front end is curating curating that data into a subset that'll give us a really good initial shot of a high performing model at the beginning and then iterating on

that. Once we have that subset of data,

that. Once we have that subset of data, more on those selection techniques in a little bit. What we can then look at is

little bit. What we can then look at is then how do we then go back and improve upon that subset? How can we choose additional samples to label based off our current state of one, what is our

current diversity in our data set with respect to what we would want in real world conditions? And number two, what

world conditions? And number two, what is the current failure modes of the model? Where is the model finding false

model? Where is the model finding false positives? In my view, more importantly,

positives? In my view, more importantly, where is the model finding false negatives? Where are we finding uh

negatives? Where are we finding uh confusion where there's class confusion uh between certain um objects or or

whatever uh type of um uh labeling that we're doing on our data set. So curation

and then iteration.

And as I mentioned before, so much of this comes from a garbage in garbage out problem. When you don't have a good

problem. When you don't have a good strategy for data curation, often the result is uh misleading metrics and then poor generalization in production. When

your model is not trained over a uh on on ultimately a correct data set that moves performance, the result is poor performing models in real life. Um the

result is missed edge cases because the model was never appropriately trained on that in the first place. Um and then there's also the idea of silent degradation and this part is critical

where ground truth is our um kind of first principles. what is the correct

first principles. what is the correct identification of cars of people or whatever objects we want to identify in our data set. Um incorrect labels silently degrade model performance because we are operating on the

assumption that the ground truth is correct. you have to um however we

correct. you have to um however we therefore need strategies to go back and check to see okay does the ground truth actually have problems such that we need to iterate on our on our labels and

that's largely where a really tight loop of inapp uh annotation can come in where you don't have to you know export uh to an external service um so that you can

you know more quickly iterate in real time within the platform in which you're curating training and evaluating So again the shift to not put a too too fine a point on it is curation before

annotation. So the traditional route is

annotation. So the traditional route is you might label everything uh perform a whole bunch of annotation then train on the model and then QA kind of might be an afterthought in the sense that um

it's difficult to kind of re reverse the loop and go back into annotation when you don't actually know what in the data is causing your model to fail. So

there's a little bit of hoping to the for the best and that's what this bottom left graphic is is showing. Um right uh you have a pool of unlabelled data. You

label everything. You train the model and then you know maybe more of your effort in this case will be on the model architecture side because that's what you know you have control over. What we

want to take is more of a feedback driven approach. We can use curation

driven approach. We can use curation techniques like embeddings and then uh uh what we'll call zcore or corset sele uh zerosophet selection to find the

highest value samples that are worth annotating and then train the model based off those samples. The idea here is that we are using very objective criteria to determine what data is

likely to have the most generalizable model on the front end and then we evaluate and find where is our model failing and then from there repeat the loop. Right? If we can identify the

loop. Right? If we can identify the parts of the embedding space that our model is not performing well on, we can redo the flow, identify additional

samples and uh either qualitatively by just examining the embedding space visually or by performing methods like core set selection that we'll talk about and then choose additional high value

labels for reanitation.

And the idea is we're having a very metrics driven approach for continuous model improvement.

So the new way rather than just grabbing random samples is to compute embeddings first so we can have a uh both a qualitative and quantitative understanding of our data visualize the

distribution and then select what to label. So part of using the using 51,

label. So part of using the using 51, which we'll look at in a little bit, is before we've even applied a label in the first place, we'll take our unlabeled

pool of data and compute visual embeddings and try to find things like outliers um or representative segments of the embedding space that we know we want

captured in our initial data set. A

technique that I'll show is by applying core set selection on top of that we can color the embedding space uh by um the the by what we call the

zcore that is uh which basically is a uniqueness value that allows us to see which parts of the embedding space are most worth our attention in generating

an initial set of uh samples to label.

So a question just came into chat. Do uh

do we apply stat statistical approaches for electing high for selecting highv value data? Um yes that that will be

value data? Um yes that that will be that will be the case. Um it's it's it's algorithmic. Um but I'll show you here

algorithmic. Um but I'll show you here in a little bit. Uh how we can apply both not just the embeddings computation and visualization but also core set selection to actually color this space and kind of it's not heat maps but

there's kind of a color gradient we'll see that allows us to point out okay here are here are unique values with high diversity. Here are values with

high diversity. Here are values with high diversity. we can select those and

high diversity. we can select those and then curate a subset from there.

And to that point, um this is borne out in research. So um

we found here at Voxil 51, our machine learning team has found that smart selection almost always uh and actually for many tasks beats random sampling. Um

so an example would be let's say you have a data set, it's 90% daytime, 10% nighttime. Well, random sampling

nighttime. Well, random sampling naturally would give you nine out of 10 samples being daytime samples. But if

nighttime is where models failing, random sampling is not the most efficient approach because random sampling reflects frequency and not diversity. So the rare but critical

diversity. So the rare but critical cases go under representative. So it

might not even just be nighttime samples. It might be samples of

samples. It might be samples of pedestrians walking alone at night. So

we can use core set selection to add unique coverage. So if we have redundant

unique coverage. So if we have redundant data samples, those will score low whereas more sparse re regions will score high. So we can get a more

score high. So we can get a more representative data set. And um one one result uh you can read the uh the full paper for details. Um it's linked in

kind of the human uh in the annotation blog post that I'll share in a little bit. But using the kind of classic

bit. But using the kind of classic imageet data set 10% of that um labeling just 10% of that data uh on a on the

initial iteration of a model gave resulted in 54% accuracy for that model um outperforming uh 100% label based uh label based

methods of 100% of data. So there we found that downstream models perform as good or if not if not better in certain cases than full data labeling um through

this core set selection practice.

So what does that mean in practice? What

what would it actually look like in our workflow? So during annotation the idea

workflow? So during annotation the idea is focus on is have focused batches. So

label say 50 curated samples not 10,000 random ones. And again, the idea here

random ones. And again, the idea here is, you know, these these are kind of round numbers. Um, but this will be an

round numbers. Um, but this will be an iterative process where we'll be continually adding additional samples that are worth labeling. Um, in addition is enforce your schema heavily. Um, so

for example, make sure that when you are using an annotation platform like we'll see in 51 in just a little while, um, ensure that you are subjecting your annotators to only using the schemas

that you allow. So make sure that you're rejecting if you know vehicle is attempted to be hard-coded when your schema only supports truck as you know a bounding box that you can draw around an object.

Um in addition you need to make sure that your samples are aligned and this is something 51 natively supports through what's called grouped data sets.

So when you're thinking about um say the equipment on a self-driving car you have multiple camera angles and then you might have LAR which would be like a 3D 3D cuboids and a point in a point cloud type sample. you need to make sure that

type sample. you need to make sure that those are aligned together within a single sample. Um so that so that you

single sample. Um so that so that you know everything from time series um to model recognition is again aligned within that within a single view. And

then when it go when it comes to then training the way our tutorial is going to approach things is you might start off with a very small sample set. Again, 20

is just kind of an example I do here. It

would probably would likely be larger because again, if we use algorithmic methods, it's pretty quick to identify the samples needing labeling. Um, but

the idea is we have, you know, a certain subset of samples with expert verified labels.

There's a uh hopefully automated schema check. So, again, we're enforcing the

check. So, again, we're enforcing the schema, but we perform quality assurance on it as well. Um and then in addition before training a coverage check to sanity check things like making sure that okay if samples have zero labels on

them at all uh you know catch this before you know it corrupts the training pipeline.

And so that's part one. Now remember I said there are two key parts to this datacentric approach. Part one is

datacentric approach. Part one is curating a diverse data set where underrepresentative regions are uh represented even though they're underrepresented in in your total data set they become a critical part of the

subset and then the second is using your model as your debugging tool in an iterative approach. So train the model

iterative approach. So train the model based off your initial selection and then listen to what it tells you. Um so

false positives will obviously reveal neo label noise or ambiguous cases. Um

those are fairly easy to identify. Uh

what is uh more intriguing is false negatives because those expose gaps in your training distribution. They're also

um kind of the unknown unknowns of the process. They are difficult to identify

process. They are difficult to identify in practice. This is where embeddings

in practice. This is where embeddings can help. Um so when you look at uh the

can help. Um so when you look at uh the um model false negatives relative to ground truth in your test set, you can look at the embeddings distribution and see okay where is the where where is the

cluster in which the model's actually failing. And then the last piece is uh

failing. And then the last piece is uh obviously your confusion matrices. Where

does your schema need refinement because the model is confusing one object for another. Again, that's a native part of

another. Again, that's a native part of of the 51 platform as well here as we'll see in uh just a minute. So again, this is going to be one of several iterative

loops. um this becomes a feedback signal

loops. um this becomes a feedback signal for your next labeling batch where from here you'll choose based off a combination of what is still under represented in the data set and where is

the model failing you'll then choose your next batch.

So what's what's a good what's a good balance here? Um there is no universal

balance here? Um there is no universal answer. It is it is well established in

answer. It is it is well established in the literature that you need to balance coverage and targeting. um one approach is about 30% for coverage and about 70% for targeting. Naturally, this would be

for targeting. Naturally, this would be a little a little iterative because um you'll adapt this based on how your model is performing and where you're finding uh influence on one side or another. So when I say coverage, I'm

another. So when I say coverage, I'm talking about diversity and this is where that uh an analysis of the embedding space and core set selection will come in. Uh this will be uh ensuring that we have a balanced

distribution of data to prevent model o uh and as well as preventing model overfitting. And then the other two/3ish

overfitting. And then the other two/3ish is targeting where we're actually mining samples similar to your failures so that we can improve the model on the specific errors your model is making. Like again

maybe the model can't identify stop stop signs in the snow, right? The idea is okay well we need examples of that to specifically tell the model this is what a stop sign looks like when it's covered with snow.

Um so again again round numbers in a budget of 100 labels per iteration. We

might see 30 samples where we're covering new regions of the embedding space and then maybe 70 samples from um failure mining again where we're finding clusters of either false positives or

I'm finding we're finding more intriguingly false negatives.

So closing the loop we're looking for an approach that looks something similar to this. So from a cold start our initial

this. So from a cold start our initial selection we don't have a model yet right so we can't do targeting at the beginning we have to focus entirely on coverage so we're focusing purely on

embeddings and uh other al algorithmic techniques to ensuring that we have a diverse subset of data so we can get a baseline model with broad coverage and

then as we continue to iterate we're going to um still focus on coverage but then increasingly focus on model targeting and as we add addition additional samples to our data set.

We're kind of approaching um not sometimes asmmptoically approaching that 7030 split that I talked about where 30% is based on coverage, 70% is based off

targeting failure modes in our model. So

we can rapidly improve uh based again on model failure modes and then again as we do later later iterations we're going to focus on okay when are we starting to plateau when are we starting to reach diminishing returns um and then we have

kind of additional you know approaches from there where we might need to focus on model architecture we might need to focus on um our total pool of data that we've been sampling from and continuing

to improve that. um we might we might uh start to approach uh different modalities again um this you know because this this exact progression I'm talking about you know on its own can

only last so long but we've seen good results with core set selection.

All right so we're I'm going to drop into kind of a a a screen share demo here in just a moment but before I do so just a couple

things to keep in mind. Um this is very uh uh conceptual when it comes to doing this in practice. Um couple couple points here. So first of all um

points here. So first of all um maintaining a frozen test set is critical right. So um there are

critical right. So um there are opportunities in 51 to have different views clones of data sets um and ensure that one you know set isn't leaking into the other. That's obviously critical

the other. That's obviously critical here because then you can't trust your model otherwise. Um tracking what's

model otherwise. Um tracking what's labeled, QA and ready for training can be done through tagging methods uh saved view meth saved and sharable view methods. Um measuring um efficiency can

methods. Um measuring um efficiency can be useful that is what is the number of labels uh in our samples per change and mean mean average precision. Uh this

should be taken a little bit with a grain of salt especially if you're using um model assisted approaches to labeling. uh because at that point

labeling. uh because at that point labeling becomes a little bit less of a tedious expensive exercise when you start using foundation models increasingly but it can still be useful

uh again especially at the beginning and then what's also useful is being able to annotate without leaving the curation envir environment right so um 51 particularly the enterprise version of

51 includes quite a lot of capabilities when it comes to data set versioning uh access control um exportable data sets within the app.

Um, having kind of a having a similar flow where you don't have to leave the app and you can kind of measure your data set versions and the progress on them from both an annotation perspective and a training evaluation perspective

within a single uh kind of single commit so to speak of your data set is uh is ideal so you don't have to kind of like you know be on a swivel back and forth between the state of your annotation

platform and the state of your training and evaluation loop. So, there's quite a few primitives that we'll show here in just a moment.

So, what does this look like? Um, this

next demo is going to be adapted heavily based off of the uh tutorial um that I linked earlier and again is can be found here at docs.fox51.com

under the getting started guides. It's

this annotation guide here. Um, for

those that want to follow along or ultimately go through this tutorial, um, right here is the code that you'll also find at the beginning of the tuto tutorial to get started quickly.

Everything I'm going to show is, uh, available, uh, in this flow is available in the open- source version of 51. It's an

installable Python library, literally just pip install 51. Um, it it includes a local web server to serve the UI. It

includes also a built-in MongoDB um database. Uh that's that'll just be

database. Uh that's that'll just be hosted locally as well. And then 51 also includes um a data set zoo of easily importable data sets to get started. So

there's literally a method you can run load zoo data set. You can find all the different zoo uh data sets we support kind of natively importable from the zoo in the docs. But the one we're going to

use to get started is this uh kind of cryptically named quick start groups data set. What this actually is, it is a

data set. What this actually is, it is a um subset of the kitty kit ti multimodal self-driving data set. Um so this uh subset of it includes um it's not huge.

It's about 200 samples. Um but it's useful for uh demonstration purposes of uh left camera views, right views, and uh 3D point clouds. And again, I'll I'll

swap over here in just a moment as well.

So when you pip install 51 and you load the data set, what does it actually look like? Well,

like? Well, give me a moment here.

It looks something like this. So the 51 app launched. Here is

this. So the 51 app launched. Here is

the server. Um we are pretending that this is a let me actually do something just pretend here for a moment. I'll

switch to another data set here in kind of iteration of this in a moment. Here

is what it looks like in the UI. You can

switch left view. Right camera view is like kind of what you're seeing in the sample grid.

You can view point clouds in the sample grid. But if I click into a particular

grid. But if I click into a particular sample and I and I open it up here, here is like the full distribution. Um,

couple things to note again. Notice in

this grouped data set modality we see whoops sorry give me one second here.

Got it. There we go.

We see the full point cloud view. We see

cuboids aligned with aligned with bounding boxes on both the left hand left camera and right camera views. Um

these are all mapped through a sample ID. Um and uh mapped in in this time

ID. Um and uh mapped in in this time series data as well. Um so this is kind of the explore pane. Uh what you'll also

notice here in uh as of last Thursday when uh 51 uh was uh when 1.13 of 51 was released is this annotate tab. So we can

see in this annotate tab this is where the annotation loop will come in where I have my schema that I can either modify through either this graphical interface

or if you've used the 51 SDK before here is the JSON schema that you can um update as well. And then within here I

can choose to say add a new bounding box. Maybe I want to tag

box. Maybe I want to tag a car. Me do that here.

a car. Me do that here.

There's my car. And now that becomes literally just a new label that is added to the data set. Um right now we support um classification and detection labels

um for both uh 2D and 3D. So, if I were to go to the oop, sorry, let me go back here for a moment. Let me change the annotation

moment. Let me change the annotation slice to PCD.

Let me find a car here.

On a similar front, if I were to find my schema here and I were to add a cuboid,

let's see where we at here. car

imperfect example with this initial drawing. But here we can see I've you know I've labeled this car here and then you can you can just as easily drag and edit existing boxes

as well.

So inapp annotation is now again fully built into 1551 where this becomes most powerful is when this becomes a QA method with the larger loop that I

discussed earlier. So what I'm going to

discussed earlier. So what I'm going to do now is I am going to go to a copy of this data set that I've kind of set up

as actually no it's right here that I've kind of set up into different views to kind of work our way through the um

um flow that I discussed earlier. So

where we are going to start is let's say that I have a raw unlabeled pool. So

again I'm using a round number number here. Let's say my pool of data is 130

here. Let's say my pool of data is 130 samples. Now where did I get this from

samples. Now where did I get this from in the first place? So couple useful things that 51 has. Um even starting here uh this might be drawn from an even

larger supererset of data like say a data lake. 51 includes some pretty

data lake. 51 includes some pretty powerful integration uh with uh with those sorts of features like your data warehouse or your data lake. data

bricks, big query, snowflake um where for example if uh in this data set here we have a feature called data lens where before I even have this initial like

pool of samples I can open this data lens feature and if I have things like vector embeddings configured in my data lakeink I can search for samples

that I want to add to like my initial kind of superset pool of data that I will then uh subset from there. So maybe

I have like pedestrians walking at night.

I might search the vector. Uh maybe if I have like you know natural language search here, I can search the space and I am querying my broader data lake in

natural language for these samples.

So this will take just a moment and I'm not going to go through the whole thing here, but you see the idea. Now these

are are are labeled, right? So these are labeled samples. Um, but even if this

labeled samples. Um, but even if this was if this was an unlabeled pool, um, it's not using these labels to to find them. It's using natural language search

them. It's using natural language search and I could import these in. But let's

say I have this pool. Um, number two is how do I select from there? Well, couple

uh, the most powerful way is to compute embeddings. So, if you were to go into

embeddings. So, if you were to go into the let me pull this up real quick.

And just to save a little time, if you were to go into the tutorial, one moment here.

Sorry, I was grabbing the right step.

This is step three of the tutorial.

you can run some code to in generate to compute embeddings on your on different slices of data. And this is a native um method in 51. You can compute the

embeddings themselves and then compute the visualization based off the embeddings. And then optionally so that

embeddings. And then optionally so that that'll give you a really good visual of your embedding space. And then

optionally you can also run um zeroot core set selection as well. again linked

here in step three of the tutorial where it basically uh it gives a uniqueness score to each sample in the data set. So

what you end up with is something that looks like this. So here is my raw embedding space again just 130 samples but we can kind of extrapolate from

here. And then also useful is if I then

here. And then also useful is if I then color by zcore, the uh more into the

green and yellow we are, the more unique the data and the more interesting it would be to add to our subset of our initial training data. So for example,

we can see kind of around here is probably worth taking a a snag at here. it's worth snagging maybe a little

here. it's worth snagging maybe a little bit up here. Again, 130 samples, so not a huge embedding space. Um, again, the larger the data set, we could

potentially see other areas that are worth selecting. And from there, you can

worth selecting. And from there, you can that can be your initial subset. So we

might then pretend that we select maybe from here and then maybe some additional areas and then we save that

into a view that maybe we're calling our initial subset. We might call it batch v

initial subset. We might call it batch v 0. And this batch v 0 we can see again

0. And this batch v 0 we can see again small example there's 37 samples here.

So you know a small subset of the total 130 samples and it's been grabbed from this area of the embedding space.

And again the idea here is this is unlabeled data. This is not a huge lift

unlabeled data. This is not a huge lift potentially for our annotators to go through. Um 51 will always support

through. Um 51 will always support exporting to like particularly open source annotation tools like seat and others. Um we have very deep integration

others. Um we have very deep integration with thirdparty annotation tools but now the platform itself includes Whoops. So

if I open up a sample, the ability to annotate these directly as well, define a schema and have it all done in app. So

I could easily do that here.

All right.

So this becomes our initial batch and then it's from here that we can train our uh first model. Again, uh for the sake of time, I'm not going to go through the whole model uh initial model

training loop here. Um note that this is also well supported in 51. Um if I go to the next step in the tutorial actually to step five.

So step four here is uh annotation QA.

Uh you can again create cuboids and 3D annotations in the app as well.

What you can then do is then you can train your first model. Uh the tutorial example has us fine-tuning um a YOLO V8 model and then from there

we can say that okay now we have let me go back oh sorry let me go back to here a view

where we might call this our evaluation set. Maybe it has 30 samples where we

set. Maybe it has 30 samples where we see there are now predictions on it. And

this is our first pass, right? It's kind

of hard to see these these are in gray here, but these are predictions that are overlaid with the ground truth. Um, good

question here in the chat. Does Zcore

work the uh it's not the exact same algorithm as 51's uh compute uniqueness function. Um it is another um uniqueness

function. Um it is another um uniqueness related um method. Um um I would I would uh the I would need to direct you to the

um actual paper uh for the deep details on how it works. Um the GitHub uh it's not the GitHub repo but um the blog I'll share as a resource here in a bit um

links to the blog that's focused on on uh Zcore. Um it is not exactly the same

uh Zcore. Um it is not exactly the same as the uh as the way the k nearest neighbor of the compute uniqueness function works that there are some differences but it is related. Um

compute uniqueness is still very powerful as well. It allows you to sort your data by uniqueness. Um find the most unique samples for later labeling.

Zero shock foret selection is is slightly different but related.

Okay. So now I have this validation set.

We have predictions overlaid.

Me go to explore here. Sorry, let me go back here for a moment.

Here's left.

For some reason, this is showing the PCD. Give me one second. That's why.

PCD. Give me one second. That's why.

And this is still showing the PCD. I

don't know why. Give me one Give me one last refresh here.

Ah, type error failed to fetch. What's

going on? This site can't be reached.

Great. One second.

I think I think my local host connection just died here. I can flip over to another one here in just a moment and show you another example while we're

waiting. In fact, while that is

waiting. In fact, while that is happening, while I'm relaunching the 51 app, which again, I think there's some gremlins in the demo system. Uh, real

quick, while um while I'm back here for a moment, just like a quick interlude, um remember I mentioned that human annotation is obviously critical. Um we

are definitely going to need to rely on experts to um annotate data. At the same time, foundation models are increasingly

supported as well. Um 51 does include um autolabeling via foundation models as um as a native function in 51 enterprise.

Um the reason it is enter in enterprise and not natively in open source is that it does require what's called delegated operations which are ways of delegating work to um attached GPUs. But the way that works but the way it works in

practice in your flow is you can obviously open up a sample annotate it manually. But also on top of that you

manually. But also on top of that you can go to this autolabeling panel and choose a target set of samples.

What type of detection you're of what type of labeling you want to perform a foundation model to do the labeling the classes you want to label in your

data set.

the minimum confidence threshold and again within the app you can apply your labels there as well. So again when I say that we're performing an annotation loop I really mean human in

the loop annotation with potential varying levels of automation.

Okay, let me see if I can reconnect to my system here.

All right. Um I think there was another question saying when do oh when do you expect to support segmentation and in se yeah yeah when do you when do you uh intend to support

segmentation masks within the built-in annotation. So uh segmentation is

annotation. So uh segmentation is supported within autolabeling. Um in

terms of the inapp label editor um it is it is a roadmap item. It is something we we eventually intend to support. um

right now you know brand new it is it is classification and object detection but segmentation is definitely on the road map I I don't have a firm date at this time but it is definitely something

we're developing and and prioritizing all right so looking at this validation set we can see okay we've applied predictions and ground truth from here

we now want to get our initial set of model metrics right this is this is our our first iteration here so we go in um depending on which version of 51 you're

using. If you're using the enterprise

using. If you're using the enterprise version, there's a built-in uh um inapp method of uh performing model evaluation. Um if you're in open source

evaluation. Um if you're in open source 51, there's an SDK method just like everything else shown here where you literally just run evaluate detections or similar method depending if you're

doing classifications, detections, um and it computes uh critical model metrics with respect to your ground truth. Now again in our case this was a

truth. Now again in our case this was a sample set of 130 samples of a particular slice in our case. Um and

then we and then and I and I fine-tuned a YOLO uh fine-tuned a YOLO model on commodity hardware. So I would not

commodity hardware. So I would not expect in this first iteration that the model would be whoops super high performing. But if I look at look at my

performing. But if I look at look at my initial initial metrics the answer is correct. It is not super high performing

correct. It is not super high performing uh whatsoever. At the same time, we can

uh whatsoever. At the same time, we can start to get some initial senses of where is the model performing, where is there class

confusion. And I would particularly

confusion. And I would particularly suggest that false negatives and um class confusion is where you'd want to start. And that's this is where we be

start. And that's this is where we be can then begin to approach our uh 3070 split of from here on out. Let's try to approach 3070 of okay looking at the

embedding space. So let me go back. So

embedding space. So let me go back. So

maybe we can find a lot of instances where let's see there's false negatives or false positives on certain things. So I might go into the embedding space and say okay

where are nighttime or pedestrian samples clustered and try to see if there's false negatives. So, I might go back here.

Rather than coloring by corsot, uh, let me actually go back to my pool.

Rather than coloring by course shot selection, I might color by Oh, you know what? I didn't add I would need to add the the ground truth label

to my embeddings computation. I didn't

do that here, but similar idea. I

apologize. You actually do this in the tutorial. I just didn't do it here, but

tutorial. I just didn't do it here, but it would highlight by class and then you might start to select clusters and then choose additional samples from there. And then it just becomes an

there. And then it just becomes an iterative loop, right? So you have your pool, you might augment that pool further from a data lake using a tool like data lens. Um and then um go

through the loop of annotate a few key samples. Train a model based off those

samples. Train a model based off those annotated samples.

Run model evaluation metrics. Find where

your model is performing poorly in ter in terms of course statistics or class confusion. And then again take an

confusion. And then again take an approach of okay where are the types of samples where the model isn't performing well.

Again, highlight those in the embedding space and then tag them for your next batch and then go from there. So, where

ultimately does this take us? Well, it

takes us into a couple places. So, let me quickly go back to my slide deck here.

First, we need to take note of a couple failure modes, right? One thing that I kind of took for granted is that our training and test sets are properly

segregated, right? You need a frozen

segregated, right? You need a frozen test set otherwise you can't trust any of the work that you're doing. The

second is be very very careful about label drift over time. Um, gatekeep your schema heavily. Make sure that um

schema heavily. Make sure that um labelers aren't adding again new labels that weren't originally um addressed by the model. Um make sure

you include a golden set of data that you know is 100% correct in its ground truth. Don't chase only edge cases,

truth. Don't chase only edge cases, right? Um that can result uh in in some

right? Um that can result uh in in some cases in model overfitting towards those edge cases cases. that uh divided approach of h maybe a thirdish towards

coverage 2/3ish towards model failure modes is important. Um again when you have that tight loop you can kind of easily kind of make sure you approach

that that ideal of whatever you choose.

And then again from a QA bottleneck standpoint um anyone who you know writes software trains models knows that eventually you need to ship often you need to ship fast you always need to

iterate. um creating focused views in 51

iterate. um creating focused views in 51 um inapp annotation fixes like we now support um is is the way to go here um because again very often exporting to a third

party annotation tool requires an entire uh different pipeline here we can keep everything compact within the platform.

So um before I uh move forward here a couple questions or I guess another question here I see in chat uh what type of multimodal data can voxil work with?

how to best annotate that. Uh for

example, 3D molecule data, description, pictures of Yeah. So um from a general

data support perspective, 51 supports um obviously 2D images, LAR data, 3D point

cloud uh data of of any type. Um uh and we support um an annotation of uh 2D and and and 3D data. Um so uh classification

detection cuboids that's kind of our initial pass um of of supporting annotation. Uh the goal is to very very

annotation. Uh the goal is to very very quickly increase the um annotation support from a general data support. So

curation embeddings training models um pretty much virtually any type of 2D and 3D data. Uh we also support video um as

3D data. Uh we also support video um as well as video frames naturally because we support 2D images. Um we also support audio and audio specttograms. Um and we're steadily increasing kind of our

coverage of what we support there as well. Um right now what's supported in

well. Um right now what's supported in uh inapp annotation is currently a subset of the total types of data modalities 51 supports but our plan is

very much to have that subset of annotation supported modalities like reach the total superset of of modalities we support across um all

parts of the app.

All right.

So, what are a couple key takeaways? Um,

I mentioned this already. Um, and then I'll kind of leave us with kind of a couple lessons learned and then where to go from here. Um, like I said before, everything I've discussed is uh available uh within the open source app

with the exception of of the autolabeling panel that I that I briefly opened. Um and then also technically the

opened. Um and then also technically the data lens panel is um something that is enterprise only as well but definitely curate annotate model training model evaluation is all supported native

natively really I'd say if there's one thing that is most critical of everything I discussed aside from the inapp annotation is the embeddings engine right using embeddings is

critical for uh again having a quantitative uh and then qualitative visualization of the embedding space where you can just have a bird's eye view of your data distribution and then

using zeroot core set selection to actually um color the embedding states of highv value samples. Now it's worth noting that um zcore is is really just

it's a mathematical algorithm. So this

um this uh GitHub repo uh in in the Voxil 51 org um is a useful way to get started. And then the tutorial actually

started. And then the tutorial actually has you run core set selection on the data uh within 51. But right now it's not just a it's not just a button that

you press. It is a method, an algorithm

you press. It is a method, an algorithm you actually run um with the SDK. Um

there is work on getting uh native plug-in support for it though, so it can be easily run through the app UI as well. Um then again add add targeting.

well. Um then again add add targeting.

So as you train and evaluate models evaluate failure modes um based off core model metrics class confusion and then update your next batch with additional

samples to label that you can again again label in the app uh natively with um manual annotation in app or with autolabeling. Um if if you are a 51

autolabeling. Um if if you are a 51 enterprise customer um again approach that 3070 balance flow and then again QA every iteration

um make sure you maintain that kind of golden data set of what you know are 100% correct samples uh to catch any type of data drift over uh compare with

your again enforce a common schema as well. So again um uh annotators aren't

well. So again um uh annotators aren't able to kind of go rogue with the types of labels they are producing. And again

if you notice what I was showing within the inapp annotation you you saw that you can enforce you know radio buttons and kind of drop down selection to make it easier on your annotators.

All right. So where to go from here? Um

there's a couple useful resources that that will be helpful. Um, number one is the conceptual flow of this entire talk

was based off of these um was based off of uh this tutorial. Um the uh the link is right here. Uh for convenience, I also included a QR code um if you're

interested.

And then if you want just more general kind of conceptual details about the uh way we've incorporated uh annotation into the platform. Um whether you're new

to 51 or you're an existing 51 user.

This is kind of this is this is a new feature in 51. Uh 51 uh at its core was a cur data curation engine um for for most of its existence. Um it still is.

it's still still our bread and butter and our our our our you know we posit that that's the best way to get better models is to start with really really good data curation um annotation is

meant to only like provide an engine to that right to make the flow tighter more iterative and make it easier to add edit and and um apply labels to your data set. So again, this blog post goes into

set. So again, this blog post goes into more detail as well. Um, if you would like to learn more, but that being said, I want to leave the

last couple minutes here for any additional questions that folks may have. I saw uh I think Joseph, you had

have. I saw uh I think Joseph, you had your hand you had your hand raised. Um,

you're you're free to answer uh you're free to ask a question if you still have one. Um, either uh here or in the Q&A,

one. Um, either uh here or in the Q&A, whatever you feel most comfortable with.

Um, happy to field any other questions as well.

Um, actually I don't know if folks might be in listenon mode on the call, so if you're trying to talk and you're muted, you might need to use the Q&A or or chat version.

I'm also looking back through the channel as through the chat as well.

Yes. Uh, statistics are I see some okay some note takers. I think I answered most of these.

Awesome. Um, another really good resource just in general to learn more, again, if you go to voxil 51.com/anotation,

51.com/anotation, that's kind of a landing page for you to get more context on how we've included annotation in the platform. Uh, for just getting started, your best friend is

going to be doc docs.voxul51.com.

Uh, that is going to be your best friend for installing 51. Uh, getting started with both the open source and enterprise versions of the software. There is a whole host of you'll notice getting

started guides uh industry specific guides, tutorials. Um they're all are in

guides, tutorials. Um they're all are in self-contained notebooks.

So um if you kind of want to, you know, download run them yourself, look at the GitHub source. Again, this is all meant

GitHub source. Again, this is all meant to be very contained. They almost always start with installing 51 from scratch if you haven't yet. So it's again meant to be meant to be easy for you to get going.

Yeah. So couple couple good questions about embeddings. So you'll notice when

about embeddings. So you'll notice when you look in the um 51 methods that uh

the the model you use for computing embeddings is uh toggable. Um so here is uh comput hold on compute visualization.

Let's see. Torch vision.

For some reason, I might I might be missing where the method is here, but literally the method, if you look at the method, it's called compute embeddings.

Um, and then you would compute the visualization based off those embeddings. You can pass in whatever

embeddings. You can pass in whatever model you choose. It is not fixed in 51.

51 is extremely open and integrable with respect to that.

Um, so you are not you are not tied into particular uh any into any particular architecture.

Um, next question. How to uh annotate multimodal data consisting of 3D molecular data and tissue pictures. So,

um, the the the three 3D annotation I think for the most part should support that.

It's going to depend on exactly what your modality looks like. Um, so again, we support classification and detection for both uh 2D and 3D data including 3D

point clouds. So when you are um

point clouds. So when you are um annotating say molecular data again it would depend on exactly your workflow there could be some classification um

involved um but again uh 3D cuboids are supported as well um uh there if you look at the data set zoo

there might be some healthcare related data sets that you can look through. We

also have full hugging face integration.

So, Joseph, my recommendation is um see if you can kind of parse through the zoo, look for maybe some healthcare related data sets. They're very easy to load uh if it's on hugging face as well and can kind of take a look and see if there's any that kind of map with your

needs and what it would look like within the an annotation feature.

Um, okay. Can I use my own embedding model for semantic search? Where will

the model be hosted for creating a feature vector from the search term? So,

there's there's a couple couple approaches for that. So um when you use 51 for any type of searchability uh

computing uniqueness computing um what we call mistakeness on the data set for any of those that are based off embeddings models the flow is you compute embeddings you literally pass in

um the model which in in in local 51 would be would be hosted locally and would need to run on local GPUs and then um from there you would pass it into the 51

method. for what we call similarity

method. for what we call similarity search or uniqueness or um or any of the others. Um that's an interesting

others. Um that's an interesting question. Um I might I might actually

question. Um I might I might actually follow up with you on that um so I can get a better answer because I might be slightly misinterpreting what you're saying. Um but yeah, when you're using

saying. Um but yeah, when you're using local 51, the model's almost always going to be hosted locally and it's going to run on local GPUs. Uh and but yes, you can use any arbitrary model for

computing uh embeddings.

Um, all right.

Next question. I'm interested in taking a snapshot for taking the top three relations based on a codebase.

Um, not fully understanding your question.

Uh, Ratatin, sorry, sorry if I mispronounced your name. Um, 51 fully supports data set versioning in its enterprise feature. I'm not exactly sure

enterprise feature. I'm not exactly sure if that's what you're referring to, like snapshotting a data set. Um that can be done through

uh in 51 enterprise there is native inapp support for creating almost basically a git style version of the data set. Um you can

naturally use other open source tools like DCS for data set versioning. Um, or

you can create an a another kind of hacky way you can do it in open source 51 is by creating separate data set views and then exporting them using an SDK method to create separate data sets

as well. I'm not exactly sure if that's

as well. I'm not exactly sure if that's what you're asking here. So, so I apologize. Uh, but there are there are

apologize. Uh, but there are there are multiple methods in which you can kind of version and snapshot existing views or even entire clones of of 51 data sets.

All right.

All right.

So um generally, yeah. So an annotation drift

generally, yeah. So an annotation drift um to to the last question here is a kind of a broad term that um really refers to like differences in like changes in labeling guidelines which

really means changes in schema that causes uh labels to become less reliable over time because the criteria to label in the first place have changed. So that

could take a couple forms. Number one is the actual label names themselves you know might change or drift over time. So

it was previously just like vehicle maybe a generic term maybe over time became car truck maybe what was considered a car or a truck and the annotation guidelines have changed. So

now you have a mixed data set of like vehicle car truck but like the model progressively comes becomes confused because you have multiple um it's basically the language is being changed and therefore the model's understanding

will change. Um another way that

will change. Um another way that annotation drift can happen is like the criteria for how you do the labeling changes. So maybe the tightness of a

changes. So maybe the tightness of a bounding box changes over time. It might

go like tighter fit to to looser with like more wiggle room. Um uh what is considered um classific you know in in in edge cases like blurry images or dusk and

dawn photos like what's considered labeling of like night versus day might slightly change over time as different personnel does the labeling. That's

really what we mean by annotate annotation drift. And what it really

annotation drift. And what it really refers to is the model the model's objective understanding of what is being labeled has changed over time because what is considered objective ground

truth has has slightly changed over time as well. Now in some cases that can

as well. Now in some cases that can actually make the model more robust if it is able to kind of generate a better understanding through multiple kind of edge cases that are labeled in different

ways. It can actually become kind of

ways. It can actually become kind of again like smart in that case. But on

the other side, if um but on the other side, it can become also confused if there's very clear cases where two labels on very similar objects are shown to be two different things based off

changes in lab in annotation guidelines.

Yeah. Uh Orin, great question. How would

you approach using embeddings when the scene is very busy? Obviously, it can cluster day and night, but what if the differences you are looking for are in the objects and not the scene? Very,

very good question. So something we did not directly address in this talk is the idea of um patches. So let's see if we actually have it set up here. We might.

Do I have a patch view here?

I don't quite have a patch view. Um

there's a method you can run called two patches in 51 that notice we have not notice each of these samples have multiple bounding boxes in them. What

the patches view does is it creates one patch per object. So for example, like a lot of these scenes might have like a dozen different uh labels on them. What

patch views do is they allow you to create um a sample for for every label.

So what and then when you compute embeddings based off of that, the embeddings become more granular on a per object basis. So when you are looking at

object basis. So when you are looking at a particular embedding, you're saying, "Okay, that is the sample that specifically has like a car in it. It

has other things as well, but there's different samples for those other things because of the patch view we created."

Um, so when you're looking at very busy scenes like that, um, you can create multiple patch views that you can then visualize the embedding space on. So

yes. Yeah, great. You're you're correct.

Yeah. So a good starting point is you would use a foundation model to create some initial uh like bounding boxes and then you can create patch views of those bounding boxes and that creates that

does a mapping of one sample per object within the sample grid and then the embedding space if you then run embeddings on that particular view after you do the patch conversion it'll

illustrate that uh space as well. Great

question.

All right.

Any any other questions?

All right. Well, again, uh we are at time. Thank you so much for your

time. Thank you so much for your attendance and participation.

Again, uh we'll be we will follow up with um some of the resources that I mentioned at both the beginning and the end. Um highly recommend Pip Install 51.

end. Um highly recommend Pip Install 51.

Give human annotation a spin. You can

learn more about it in the resources I mentioned. Uh but otherwise, enjoy the

mentioned. Uh but otherwise, enjoy the uh rest of your Wednesday. Thanks.

Thanks so much everyone.

Loading...

Loading video analysis...