Depth Anything V2 Monocular Depth Estimation (Explanation and Real Time Demo)
By Kevin Wood | Robotics & AI
Summary
## Key takeaways - **Depth Anything V2: Monocular Depth Estimation**: Depth Anything V2 is a new model capable of generating depth maps from a single image, expanding applications in 3D reconstruction, navigation, and AI-generated content. [00:08], [01:28] - **Synthetic Data: Fine Details & Transparency**: Synthetic data offers advantages by correctly labeling fine details and enabling depth estimation for transparent or reflective surfaces, which are challenging with real-world data. [03:32], [03:49] - **Student-Teacher Model Architecture**: Depth Anything V2 utilizes a student-teacher model, where a larger teacher model trains a lighter student model using pseudo-labels on real images to achieve accuracy and efficiency. [05:05], [05:54] - **Handling Reflective and Transparent Surfaces**: Depth Anything V2 shows improved performance on reflective surfaces by correctly identifying them as part of the object, and on transparent objects by treating them as solid, unlike previous models. [08:19], [09:03] - **Fine Detail Capture**: The model excels at capturing fine details, such as the structural elements of a bridge or individual keys on a keyboard, which were previously lost or smoothed out in earlier versions. [09:55], [10:08] - **Real-Time Demo Performance**: In a real-time demo using the small model, Depth Anything V2 demonstrated good performance on objects with reflective rims and transparent containers, though it could still be 'fooled' with internal objects. [10:39], [11:38]
Topics Covered
- Why real-world labeled data falls short for AI.
- Synthetic data's untapped potential and current limitations.
- How the teacher-student model revolutionizes depth estimation.
- Depth Anything V2 excels at difficult objects.
- Can Depth Anything V2 deliver real-time performance?
Full Transcript
depth anything V2 just came out and
there's a paper as well as a GitHub repo
that goes along with it but this model
what it can do is take a single image
and obtain the depth map for you using
only one camera so here previously we've
talked about the dep map but using a
different model called Midas so you
could go ahead and check out that video
for a comparison but in this video we'll
be focusing on monocular depth
estimation applications talk about the
key performance metrics go over the
problems with real labeled data go over
the advantages and challenges with
synthetic data go over the architecture
that the depth anything V2 uses which is
called a student teacher model we're
going to go into detail of The
annotation pipeline The Benchmark on
standard data set the da2k data set for
depth anything V2 go over reflective
surfaces transparent objects find
details and finally we will have a
real-time demo using the depth anything
V2 as you can see here on the
[Music]
right so monocular depth estimation
applications usually there's a couple of
classical applications so for those you
have things like 3D reconstruction
navigation and autonomous driving for
more modern applications we're diving
into the field of a a generated content
so things like images videos and 3D
scenes so some key performance metrics
that this new depth anything V2 is
trying to hit are things like fine
detail transparent objects Reflections
uh complex SCS efficiency and
transferability so these are the
different things that it's trying to
excel in and you can see here here's
some uh comparison in terms of the
mirror gold and depth anything V1 and
you can see that usually the mirold is
good at certain things and then for the
depth anything V1 is good at certain
things as well so the main thing that
depth anything V2 is trying to achieve
is to be the Best of Both Worlds so it
wants to be good at all these things
while um preserving some of the fine
details and doing well when there's
surfaces that's hard to detect and we'll
see later on how it actually performs
both in some of the examples that I gave
us as well as in real time
so the problem with real labeled
data couple things that we see is that
usually there's going to be some labeled
noise so due to transparent objects or
the surface could be textureless or
repetitive sometimes it's going to be uh
noisy when you actually label the data
and another thing too is that there's
going to be missing data sometimes the
very fine details are smoothed out so
you can see here is an example um the
middle is on the real image and then on
the right is on the synthetic
specifically if you look at the grass
here grass or flowers you can see that
it's pretty much smoothed out it's
almost like it's a calcium blur applied
to it uh but here you can see that it's
much more detailed and if we take a look
at the tree you could see the same
behavior so a lot of the details of the
leaves you can't quite see but here on
the right with the synthetic you could
see a lot more of the
details so advantages and challenges
with with synthetic data couple of
advantages is you could the fine details
are labeled correctly um so as you can
see here on the right these are some of
the synthetic images and you can see all
the details are very uh finely shown in
all the depth map and you can also
obtain depth of the transparent this
should be f um depth of transparent or
reflective surfaces so that's another
big Advantage because if you generate
the synthetic data um it's not going to
know that it's transparent you just
it'll just treat it as a solid object
but even though there's advantages we
also have some challenges so some things
is that the synthetic image can be too
photo realistic and the color
distribution will be uh different
between the real and synthetic and
another point is that the limited types
of scenes use so a lot of times in real
life the scenes could be more complex
and sometimes these complexities can be
pretty hard to um synthesize in your
synthetic data so if you take a look at
some of these images here you could kind
of tell that it's synthetic just based
on how the color looks there's just
something about uh fake data that you
could kind of just tell by the coloring
that it's fake so there might be some
time before uh pure synthetic data looks
real so hopefully when that gets better
you'll see a better
Improvement but here we talk about depth
anything and specifically we'll go into
the model architecture that they use
which is a teacher student model but the
main idea of this teacher student model
is you want to get a lighter weight
model and it's also accurate so you
could think of usually the teacher model
is going to be a heavier model and the
student model will be a lighter model so
specifically um depth anything is
actually based off of the dino uh V2
model from meta so the t-shir model is
actually using the big one the e2g and
is being trained using the synthetic
images so you can see here in the First
Column that's what it's
doing and then next up is going to use
the T-shirt model to create pseudo
labels on the unlabeled real images so
here you can see that in the middle
pipeline that's what it's doing and then
lastly we have the student model so
we're going to be training the student
model and it's using the small dyo V2
and this will create using the pseudo
label real images is going to train the
student model and then this can be used
in our real application later
on so The annotation pipeline the idea
of that is first it uses Sam so segment
anything model from meta to sample the
different regions and then if all four
models so you can see here's testing on
four models step anything V1 V2 marold
and Geo wizard so if all four matches
then they're going to resample points if
there's some problems then the humans
going to go ahead and intervene and
annotate so that's how they set up their
data Pipeline and then here you can see
this is The Benchmark on standard data
sets so the kitty is a pretty popular
data set and then these other ones as
well are some standard data sets but you
can see here this is the performance
comparison between Midas depth anything
depth anything V1 and V2 you can see
that these numbers um in terms of some
of the performance at least between V1
and V2 aren't drastically different some
of the differences between V2 and Midas
however is more significant but the main
key Point here is that they did most of
their evaluation on the da2k data set so
this one actually has uh eight different
categories and some of the main
categories that you can see that it's
broken up into is the trans transparent
reflective adverse style aerial
underwater object indoor outdoor and
non-real but here you can see that this
is the comparison between the four
models here's the extra one here called
depth FM and then it Compares it with a
depth anything V2 and you can see that
some of these accuracy percentages on
the left are in the 80s and then the
depth anything V2 is in the upper 90s
and one thing to note is that the small
compared to the biggest model the G is
not significantly better so you could
probably go away um do well with the
real time using the small model because
you won't lose too much uh performance
because you can see it's pretty
similar now let's take a look at how it
does on reflective surfaces so here you
can see that this table here has a lot
of reflection you can see in depth
anything V1 it starts treating that as
um part of the background when it
shouldn't be and if you look closely
here in depth anything V2 you can see
that it's actually treating the
reflective part of the table as part of
the table in the dep map so that's a
good sign and similarly here in the
building you can see that it's a
reflective building and in the depth
anything V1 it actually thinks it's a
building but in V2 it treats that as
just a reflection so it's not like it's
seeing another
building and now here's some examples of
transparent objects so you can see here
up on the top row uh there's these
containers holding candies and you can
see that the anything V1 it tends to
look inside of the containers whereas in
V2 it treats the containers as whole
objects and then here we have another on
the bottom these are some containers on
the table you can see that it tends to
see some parts inside which is not
desirable but you can see on the depth
anything V2 it treats it the surface as
um completely as a solid object so
overall you can see that depth anything
V1 tends to look inside of the
transparent uh surface whereas V2 sees
it as uh
more sees it as like an object instead
of looking
inside and here is some examples of fine
details using depth anything V2 so you
can see that here here the first row is
of a bridge and you can see that depth
anything V2 captures a lot of these
details of the bridge these trust
members here which is a lot of it is
lost in the V1 and then here you can see
that on the second row with um this
what's it called again the wheel
cartwheel I forgot what you call it but
you can see that here you can see that
there's a lot of details that's showing
Asos in the buildings A lot of these
beams here there's a lot more details
here as well okay so next up we're going
to see a real time demo so you can see
here we have on the left which is the
feed from my webcam and then on the
right is the actual depth map um so you
could see that we could test out
different objects so here we have a cup
and notice that I'm using the small
model here and it's a little bit laggy
it's probably going to have different
Performance Based on what computer
you're using but if you could take a
close look at my cup the rim is actually
reflective here and you can see that
even though there's reflection it
doesn't really affect the performance of
the cup so here I'm going to move it
back a little bit you can see how it
looks like again it's doing pretty well
it's not treating that reflection as
anything different so that's a good
sign and here let's take a look at a
transparent object so here is a
container I have here mellow mow it's
pretty good dessert place if you haven't
tried but you can see that even though
it's um transparent you can see my thumb
on the left it treats this as an entire
object so it doesn't really care about
what it sees inside which is pretty good
now if I try opening the container and
then I'm going to put a finger inside so
if you check it out if I put a finger
inside uh um now you could kind of see
something going
on you could see that inside it kind of
sees a finger inside the depth map as
you can see
so um when you look at it this way maybe
it's not quite what you expected so you
still could kind of fool it as you can
see so maybe in this specific
application you may need extra training
data now let's take a look at something
detailed so
let's take a look at our keyboard here
so our keyboard has a lot of details you
can see that if I move it really close
you can kind of see the individual keys
which is pretty good now let's see how
it looks when we start moving it back so
if I move it way back you can see that
um it starts to lose a little bit of
detail
here and you can't really see the
individual keyboards anymore maybe very
faintly if you look look very
close but yeah you have to be pretty
close to see all the fine details but
overall I would say the depth anything
V2 is much better than the Midas test I
did a while ago and if you want to have
the code for this video go ahead and
check out my website I'll put a link in
a pin comments below so go ahead and
check it out found this video helpful
give a like And subscribe and I'll see
you in the next one
[Music]
Loading video analysis...