SMPL in Mixed Reality at Microsoft

By Microsoft Research

Summary

## Key takeaways - **Microsoft Mesh: Shared Presence in Mixed Reality**: Microsoft Mesh enables shared presence and experiences through mixed reality, allowing users to interact as if they are in the same room, regardless of physical distance. [00:17] - **SMPL Drives Realistic Avatars in Mixed Reality**: The SMPL body model is used to create avatars that represent users as faithfully as possible in Microsoft Mesh, driven by HoloLens head and hand tracking data without external cameras. [03:08], [03:30] - **Overcoming Sparse Data with SMPL Prior**: SMPL acts as a strong prior to solve the under-constrained problem of inferring full body pose from limited HoloLens head and hand tracking signals, making real-time avatar animation feasible. [04:21], [04:43] - **Efficient Cloud Service for Real-Time Avatars**: Optimized SMPL fitting, adapted from efficient in-house trackers, enables a cloud service that runs cheaply on low-spec machines and supports numerous users simultaneously for telepresence. [05:15], [05:50] - **Synthetic Data Generation for Hand Tracking**: Microsoft uses SMPL to generate realistic synthetic training data for hand tracking, overcoming limitations of previous pipelines that only modeled hands up to the forearm. [07:16], [08:14] - **SMPL Shape Space as a Data Multiplier**: SMPL's shape space acts as a data multiplier, allowing a single captured pose sequence to be rendered with diverse body shapes, significantly increasing the variety of synthetic training data. [09:45]

Topics Covered

Limited Data, Full Avatars: Overcoming Under-Constrained Pose Estimation
Cloud-Powered Telepresence: Enabling Shared Mixed Reality Experiences
Parametric Models: Foundation for Diverse Synthetic Data
Realistic Avatars: From Naked Geometry to Clothed Digital Humans
Synthetic Data: Training Robust AI Without Real Images

Full Transcript

hi i'm errol a scientist at microsoft

cambridge uk

and i'm delighted to talk to you today a

little bit about how we use

simple in mixed reality projects both in

research

and in products to set the stage

i'd like to start by showing you a short

video to introduce microsoft mesh

a new service we have that enables

shared presence and shared experiences

through mixed reality

connection is a spark that gives our

lives meaning

[Music]

it drives us to seek out others who feel

the same way

okay why don't you input the data and

we'll take a look together

hey mario what you got for me to find

those who share our views yet offer

different perspectives

look over here challenge us with new

ways of seeing

but i think we should deepen our

understanding

[Music]

and enrich our lives

[Music]

so for you

great things happen when we commit to

something bigger than ourselves take a

closer look at it

place this here let's see how we go from

there okay

this sense of collaboration and the

feelings of connection it brings

excites us hey just in time

i'm gonna move it slightly okay it's

yours take it

we have two planes right now on the same

trajectory as we put people

first technology fades into the

background

and feels like anything but aisha what

do you think i think if we had 330

maintaining 2 800 we'll be clear for

approach

this changes the way we see the world

and in turn

changes the world we see these numbers

are looking great actually

there's promise in the possibilities and

what we see and create next

will stretch the imagination good

morning sarah morning

slowly coming towards the thumb world

without

a lot better than boundaries yeah

excellent

slowly bringing the world where

technology enhances not limits

humanity

with people front center and in the

spotlight

the future is here and here

can be anywhere introducing microsoft

mesh

[Music]

wow what a video we covered a lot of

different ways

to share experiences in mixed reality

one of these ways is with avatars these

3d models of people

our goal is that these avatars represent

the user as faithfully as possible

and for that we use motion models which

uses simple

now that video was awesome but also

included quite a bit of smoke and

mirrors

so what does the real thing look like

let's look at what's going on under the

hood

on the left you can see one of my

colleagues tom and on the right you can

see his avatar

a simple body which is animated uniquely

using the input

received from the whole lens he's

wearing to be very clear here

there are no external cameras used for

any sort of tracking

we're only using the head and hand

tracking signals from the hololens one

by tom

to infer the pose and motion of his

upper body and use this to drive an

avatar

which moves just like it

so here you can see tom performing some

you know typical whole lens hand

tracking motion

and what you can see now are the actual

input signals that go into the system

the head pods and the hand pose in this

case and you can sometimes see the hand

poses jump in and out

as the hands go in and out of the whole

lenses field of view

but still we're able to recover most of

tom's pose pretty faithfully

so how does this work clearly

one problem that we're dealing with is

very limited data

as you can see in the video on the left

we receive just head pose

hand pose and fingertip locations from

the hololens

the problem of getting a human body pose

from just this very sparse signal

is incredibly under constrained so how

can we make this tractable

so we did this by taking advantage of

simple which acts as a strong prior

for the unconstrained problem we have at

hand

so now we can do something quite typical

recovering simple's pose by minimizing

an energy

including a data term to make simple's

head and hands line up with the signals

and a pose prior to encourage simple's

pose to be likely

however it's not good enough to be able

to fit simple to one-person signals at a

time

we want to enable telepresence to let

people communicate and collaborate

in mixed reality experiences and for

that we need to be able to run

fast this is so that we can run on quite

low spec cloud machines to run as

cheaply as possible to make a cloud

service viable

no monster gpu machines allowed here and

secondly this is because we need to be

able to support

lots of users connecting to a single

session at once

fortunately we already have plenty

in-house experience for doing body model

fitting in real time

for example the azure connect schedule

tracker includes an

excellent very efficient model fitter

that does rather similar stuff

so we adapted it for simple and

introduced a number of optimizations in

order to build a service

which runs efficiently and reliably in

the cloud

so let's take a look at the result

here is what motion models looks like

here you can see

two people who are hundreds of

kilometers apart and still interacting

like they were in the same room

each one is just using a whole lens no

additional cameras

their head and hand poses are being

streamed to a cloud service which is

fitting simple to both of them

in real time the upper video shows what

the purple user sees through their whole

lens

they can engage with the blue user see

their movement hear their voice

and even you know receive these virtual

markers for them and here's another

example of a couple of people

collaborating on a shared virtual object

and in this case i think it's a training

session you know how

explaining how to use some sort of

equipment

and when you try this sort of thing out

you can really see how these types of

interaction

makes virtual meetings which we all got

so used to last year a totally different

experience

so that's how we use simple in microsoft

mesh

but before we go into the next part of

this talk i'd like us to rewind a bit

the avatars we just saw there were being

driven using hand tracking

but where did these hand tracking poses

come from

here's a debugger view of a hand

tracking algorithm

running on hole lens 2. and you may be

surprised to find out

that hololens 2 hand tracking was

trained using synthetic data

training images made using computer

graphics you can see what that looks

like up here

and we had an awesome synthetic pipeline

for this but the problem was for us

that our synthetic hand models ended at

the forearm

so while the hand model looked very

realistic close up

and gave us great ground truth for

machine learning we couldn't make

realistic full frame images

and that's why we still had to use a lot

of real data with much simpler labels

bear in mind

for training this network which we

called the hand detector

for us this was frustrating so we had

this amazing synthetic pipeline for the

hand only

that gave us you know great machine

learning results but no ability to make

good synthetic data for this hand

detector that ran on the full frame

and so we asked ourselves what do we

need to do to get to a place where we

can have a hundred percent

synthetic training data we're in the

simple tutorial

so i think you may have already guessed

the answer

so we licensed simple and began our

journey of using simple for synthetic

data

and why use symbol for synthetics so

you know i believe parametric models are

a great foundation for making synthetic

data

and simple as the best-in-class

parametric body model around

large pose databases like a mass exist

and we can sample from these

and it has a traditional formulation

which makes it quite amenable for

graphics tooling

and this means that we'll be able to

find a way to turn these know quite

plastically looking simples

into something much more realistic

realistic enough to be used as training

data

so collections like a mass are amazing

and provide

you know great pose diversity which is

super good for learning things like

generic pose priors but sometimes with

synthetics you'll want some very special

poses

which you cannot find in existing data

sets

for example no one has yet captured a

great set of poses that correspond to

typical

hand tracking style interaction for

mixed reality devices

fortunately we have our own motion

capture studio where we can suit up

and capture special poses as required

and that's what we can see here in the

middle

so this starts off as just optical

marker data so we have to run mosh

on the mocap data to turn these markers

into simple poses and that's what you

can see on the right

so the white dots are the optical

markers the blue dots and the simulated

ones and the blue match is the the final

simple result

and so we do this with our pie torch

implementation of marsh that fits

shape and pose to the entire sequence

simultaneously

so one really cool thing with simple is

how its shape space

acts as a data multiplier so

you know like you saw in the previous

slide we only have one person performing

poses on the stage

but we can take that post sequence and

switch up the body shape completely and

then re-render it

and this is why parametric models are so

great for synthetics they really empower

you in making your data

as diverse as possible so now we have

some suitable poses

time to make our symbols look realistic

simple is an awesome model of the

geometry of naked people but we do need

to add some skin

and so this is where we start leaning

into the visual effects toolbox

and pull out all the usual tricks to

make a low poly mesh like simple

look convincingly realistic we start

with these really nice high quality

photogrammetry scans that we licensed

from 3d scan store

and we fit simple to them baking out

textures for albedo displacement and

bump

then we subdivide simple to get a lot of

really high resolution geometry and

apply these materials and the results

can start to look good

you can see the sort of normally bits of

bone poking out about the wrists and

you know this guy sticks back

we all want to run machine learning on

people who aren't naked

so we need to dress simple there are

many methods out there now for making

simple look good

clothed and they generally generally

rely on having some sort of clothing

geometry on top of simple

we went with the visual effects approach

using marvelous designer which is this

piece of software you see here to

prepare the clothing

clothes made in this way have really

nice uv maps for material texturing

and can be simulated to get these really

realistic draping and wrinkles

but what do we do when simple moves

around we can try to simulate the

clothing over a post sequence but things

get

really computationally expensive and

rather unstable

so we have to find another way

and here's a way we get inspired by the

simple community

at iccb 2019 we saw text to shape and it

was really impressive how well

a symbol could be made to look clothed

using just displacement

so beforehand we were rigging each

clothing item up as a mesh

and things started to get a bit

complicated with many different clothing

items all needing the entire set of

simple blend shapes transferred

but with this displacement map approach

things got a lot easier

we could set up clothing as a material

only meaning the only mesh in the scene

is simple and all the clothing detail is

applied in the shader

and the good news is that this can look

pretty good as you can see here

with enough high resolution geometry and

when we bake the maps right even tiny

details of the buttons are visible for

close-ups

and we can use another bump texture

still to retain fabric detail

here you can see a selection from our

digital wardrobe once we figured out

that this displacement map clothing

workflow was viable

we partnered with another team in

redmond who really scaled the whole

thing up

the approach is now to author different

parts of outfit separately

so tops bottoms and shoes all have their

own displacement map and these can be

kit bashed or composited together

into a single coherent outfit and these

materials are really easy to apply to

the simple mesh

we are still building this library up

but i think it's amazing how well

different types of clothing can be

represented

with this technique

so here's an example of what this can

look like so on the left you can see

simple

clothes having mocap animations being

replayed on top of it

and on the right you can see a simulated

egocentric view

so this is a view a bit like what you

might see from a head mounted device

and i think you'll agree this looks

pretty realistic

now synthetic data is a bit pointless

unless we actually do some machine

learning

so let's finally render a bunch of

synthetic training data out for hand

tracking

and in the bottom left you can see some

examples

and well as they say the proof is in the

pudding

using simple base synthetics alone we

can train machine learning systems that

have no problem generalizing

to real data for these challenging tasks

on the top right you can see hand

detection working quite robustly

these colorful blobs of confidence maps

being drawn to show the estimated

locations of the hands

and in the bottom right you can see some

hand keypoint estimation that's working

pretty well so just to re-emphasize

these neural networks were trained with

simple-based synthetic data

only they never saw a single real image

and these are some of the first results

that convinced us that simple could be

used to fully replace real data

in a project like hand tracking and if

you know how hard it is to collect real

data this is a super exciting prospect

thank you so much for listening it was

really fun to talk to you today

about how we use simple in mixed reality

please do get in touch with us

if you're interested in working with us

at the forefront of ai for virtual

presence

and please check out our sponsor session

tomorrow where myself and my colleague

tadas will be talking a little bit more

about how we use synthetic data thanks

very much

Loading...

Loading video analysis...