LongCut logo

Depth Anything V2 Monocular Depth Estimation (Explanation and Real Time Demo)

By Kevin Wood | Robotics & AI

Summary

## Key takeaways - **Depth Anything V2: Monocular Depth Estimation**: Depth Anything V2 is a new model capable of generating depth maps from a single image, expanding applications in 3D reconstruction, navigation, and AI-generated content. [00:08], [01:28] - **Synthetic Data: Fine Details & Transparency**: Synthetic data offers advantages by correctly labeling fine details and enabling depth estimation for transparent or reflective surfaces, which are challenging with real-world data. [03:32], [03:49] - **Student-Teacher Model Architecture**: Depth Anything V2 utilizes a student-teacher model, where a larger teacher model trains a lighter student model using pseudo-labels on real images to achieve accuracy and efficiency. [05:05], [05:54] - **Handling Reflective and Transparent Surfaces**: Depth Anything V2 shows improved performance on reflective surfaces by correctly identifying them as part of the object, and on transparent objects by treating them as solid, unlike previous models. [08:19], [09:03] - **Fine Detail Capture**: The model excels at capturing fine details, such as the structural elements of a bridge or individual keys on a keyboard, which were previously lost or smoothed out in earlier versions. [09:55], [10:08] - **Real-Time Demo Performance**: In a real-time demo using the small model, Depth Anything V2 demonstrated good performance on objects with reflective rims and transparent containers, though it could still be 'fooled' with internal objects. [10:39], [11:38]

Topics Covered

  • Why real-world labeled data falls short for AI.
  • Synthetic data's untapped potential and current limitations.
  • How the teacher-student model revolutionizes depth estimation.
  • Depth Anything V2 excels at difficult objects.
  • Can Depth Anything V2 deliver real-time performance?

Full Transcript

depth anything V2 just came out and

there's a paper as well as a GitHub repo

that goes along with it but this model

what it can do is take a single image

and obtain the depth map for you using

only one camera so here previously we've

talked about the dep map but using a

different model called Midas so you

could go ahead and check out that video

for a comparison but in this video we'll

be focusing on monocular depth

estimation applications talk about the

key performance metrics go over the

problems with real labeled data go over

the advantages and challenges with

synthetic data go over the architecture

that the depth anything V2 uses which is

called a student teacher model we're

going to go into detail of The

annotation pipeline The Benchmark on

standard data set the da2k data set for

depth anything V2 go over reflective

surfaces transparent objects find

details and finally we will have a

real-time demo using the depth anything

V2 as you can see here on the

[Music]

right so monocular depth estimation

applications usually there's a couple of

classical applications so for those you

have things like 3D reconstruction

navigation and autonomous driving for

more modern applications we're diving

into the field of a a generated content

so things like images videos and 3D

scenes so some key performance metrics

that this new depth anything V2 is

trying to hit are things like fine

detail transparent objects Reflections

uh complex SCS efficiency and

transferability so these are the

different things that it's trying to

excel in and you can see here here's

some uh comparison in terms of the

mirror gold and depth anything V1 and

you can see that usually the mirold is

good at certain things and then for the

depth anything V1 is good at certain

things as well so the main thing that

depth anything V2 is trying to achieve

is to be the Best of Both Worlds so it

wants to be good at all these things

while um preserving some of the fine

details and doing well when there's

surfaces that's hard to detect and we'll

see later on how it actually performs

both in some of the examples that I gave

us as well as in real time

so the problem with real labeled

data couple things that we see is that

usually there's going to be some labeled

noise so due to transparent objects or

the surface could be textureless or

repetitive sometimes it's going to be uh

noisy when you actually label the data

and another thing too is that there's

going to be missing data sometimes the

very fine details are smoothed out so

you can see here is an example um the

middle is on the real image and then on

the right is on the synthetic

specifically if you look at the grass

here grass or flowers you can see that

it's pretty much smoothed out it's

almost like it's a calcium blur applied

to it uh but here you can see that it's

much more detailed and if we take a look

at the tree you could see the same

behavior so a lot of the details of the

leaves you can't quite see but here on

the right with the synthetic you could

see a lot more of the

details so advantages and challenges

with with synthetic data couple of

advantages is you could the fine details

are labeled correctly um so as you can

see here on the right these are some of

the synthetic images and you can see all

the details are very uh finely shown in

all the depth map and you can also

obtain depth of the transparent this

should be f um depth of transparent or

reflective surfaces so that's another

big Advantage because if you generate

the synthetic data um it's not going to

know that it's transparent you just

it'll just treat it as a solid object

but even though there's advantages we

also have some challenges so some things

is that the synthetic image can be too

photo realistic and the color

distribution will be uh different

between the real and synthetic and

another point is that the limited types

of scenes use so a lot of times in real

life the scenes could be more complex

and sometimes these complexities can be

pretty hard to um synthesize in your

synthetic data so if you take a look at

some of these images here you could kind

of tell that it's synthetic just based

on how the color looks there's just

something about uh fake data that you

could kind of just tell by the coloring

that it's fake so there might be some

time before uh pure synthetic data looks

real so hopefully when that gets better

you'll see a better

Improvement but here we talk about depth

anything and specifically we'll go into

the model architecture that they use

which is a teacher student model but the

main idea of this teacher student model

is you want to get a lighter weight

model and it's also accurate so you

could think of usually the teacher model

is going to be a heavier model and the

student model will be a lighter model so

specifically um depth anything is

actually based off of the dino uh V2

model from meta so the t-shir model is

actually using the big one the e2g and

is being trained using the synthetic

images so you can see here in the First

Column that's what it's

doing and then next up is going to use

the T-shirt model to create pseudo

labels on the unlabeled real images so

here you can see that in the middle

pipeline that's what it's doing and then

lastly we have the student model so

we're going to be training the student

model and it's using the small dyo V2

and this will create using the pseudo

label real images is going to train the

student model and then this can be used

in our real application later

on so The annotation pipeline the idea

of that is first it uses Sam so segment

anything model from meta to sample the

different regions and then if all four

models so you can see here's testing on

four models step anything V1 V2 marold

and Geo wizard so if all four matches

then they're going to resample points if

there's some problems then the humans

going to go ahead and intervene and

annotate so that's how they set up their

data Pipeline and then here you can see

this is The Benchmark on standard data

sets so the kitty is a pretty popular

data set and then these other ones as

well are some standard data sets but you

can see here this is the performance

comparison between Midas depth anything

depth anything V1 and V2 you can see

that these numbers um in terms of some

of the performance at least between V1

and V2 aren't drastically different some

of the differences between V2 and Midas

however is more significant but the main

key Point here is that they did most of

their evaluation on the da2k data set so

this one actually has uh eight different

categories and some of the main

categories that you can see that it's

broken up into is the trans transparent

reflective adverse style aerial

underwater object indoor outdoor and

non-real but here you can see that this

is the comparison between the four

models here's the extra one here called

depth FM and then it Compares it with a

depth anything V2 and you can see that

some of these accuracy percentages on

the left are in the 80s and then the

depth anything V2 is in the upper 90s

and one thing to note is that the small

compared to the biggest model the G is

not significantly better so you could

probably go away um do well with the

real time using the small model because

you won't lose too much uh performance

because you can see it's pretty

similar now let's take a look at how it

does on reflective surfaces so here you

can see that this table here has a lot

of reflection you can see in depth

anything V1 it starts treating that as

um part of the background when it

shouldn't be and if you look closely

here in depth anything V2 you can see

that it's actually treating the

reflective part of the table as part of

the table in the dep map so that's a

good sign and similarly here in the

building you can see that it's a

reflective building and in the depth

anything V1 it actually thinks it's a

building but in V2 it treats that as

just a reflection so it's not like it's

seeing another

building and now here's some examples of

transparent objects so you can see here

up on the top row uh there's these

containers holding candies and you can

see that the anything V1 it tends to

look inside of the containers whereas in

V2 it treats the containers as whole

objects and then here we have another on

the bottom these are some containers on

the table you can see that it tends to

see some parts inside which is not

desirable but you can see on the depth

anything V2 it treats it the surface as

um completely as a solid object so

overall you can see that depth anything

V1 tends to look inside of the

transparent uh surface whereas V2 sees

it as uh

more sees it as like an object instead

of looking

inside and here is some examples of fine

details using depth anything V2 so you

can see that here here the first row is

of a bridge and you can see that depth

anything V2 captures a lot of these

details of the bridge these trust

members here which is a lot of it is

lost in the V1 and then here you can see

that on the second row with um this

what's it called again the wheel

cartwheel I forgot what you call it but

you can see that here you can see that

there's a lot of details that's showing

Asos in the buildings A lot of these

beams here there's a lot more details

here as well okay so next up we're going

to see a real time demo so you can see

here we have on the left which is the

feed from my webcam and then on the

right is the actual depth map um so you

could see that we could test out

different objects so here we have a cup

and notice that I'm using the small

model here and it's a little bit laggy

it's probably going to have different

Performance Based on what computer

you're using but if you could take a

close look at my cup the rim is actually

reflective here and you can see that

even though there's reflection it

doesn't really affect the performance of

the cup so here I'm going to move it

back a little bit you can see how it

looks like again it's doing pretty well

it's not treating that reflection as

anything different so that's a good

sign and here let's take a look at a

transparent object so here is a

container I have here mellow mow it's

pretty good dessert place if you haven't

tried but you can see that even though

it's um transparent you can see my thumb

on the left it treats this as an entire

object so it doesn't really care about

what it sees inside which is pretty good

now if I try opening the container and

then I'm going to put a finger inside so

if you check it out if I put a finger

inside uh um now you could kind of see

something going

on you could see that inside it kind of

sees a finger inside the depth map as

you can see

so um when you look at it this way maybe

it's not quite what you expected so you

still could kind of fool it as you can

see so maybe in this specific

application you may need extra training

data now let's take a look at something

detailed so

let's take a look at our keyboard here

so our keyboard has a lot of details you

can see that if I move it really close

you can kind of see the individual keys

which is pretty good now let's see how

it looks when we start moving it back so

if I move it way back you can see that

um it starts to lose a little bit of

detail

here and you can't really see the

individual keyboards anymore maybe very

faintly if you look look very

close but yeah you have to be pretty

close to see all the fine details but

overall I would say the depth anything

V2 is much better than the Midas test I

did a while ago and if you want to have

the code for this video go ahead and

check out my website I'll put a link in

a pin comments below so go ahead and

check it out found this video helpful

give a like And subscribe and I'll see

you in the next one

[Music]

Loading...

Loading video analysis...