SMPL in Mixed Reality at Microsoft
By Microsoft Research
Summary
## Key takeaways - **Microsoft Mesh: Shared Presence in Mixed Reality**: Microsoft Mesh enables shared presence and experiences through mixed reality, allowing users to interact as if they are in the same room, regardless of physical distance. [00:17] - **SMPL Drives Realistic Avatars in Mixed Reality**: The SMPL body model is used to create avatars that represent users as faithfully as possible in Microsoft Mesh, driven by HoloLens head and hand tracking data without external cameras. [03:08], [03:30] - **Overcoming Sparse Data with SMPL Prior**: SMPL acts as a strong prior to solve the under-constrained problem of inferring full body pose from limited HoloLens head and hand tracking signals, making real-time avatar animation feasible. [04:21], [04:43] - **Efficient Cloud Service for Real-Time Avatars**: Optimized SMPL fitting, adapted from efficient in-house trackers, enables a cloud service that runs cheaply on low-spec machines and supports numerous users simultaneously for telepresence. [05:15], [05:50] - **Synthetic Data Generation for Hand Tracking**: Microsoft uses SMPL to generate realistic synthetic training data for hand tracking, overcoming limitations of previous pipelines that only modeled hands up to the forearm. [07:16], [08:14] - **SMPL Shape Space as a Data Multiplier**: SMPL's shape space acts as a data multiplier, allowing a single captured pose sequence to be rendered with diverse body shapes, significantly increasing the variety of synthetic training data. [09:45]
Topics Covered
- Limited Data, Full Avatars: Overcoming Under-Constrained Pose Estimation
- Cloud-Powered Telepresence: Enabling Shared Mixed Reality Experiences
- Parametric Models: Foundation for Diverse Synthetic Data
- Realistic Avatars: From Naked Geometry to Clothed Digital Humans
- Synthetic Data: Training Robust AI Without Real Images
Full Transcript
hi i'm errol a scientist at microsoft
cambridge uk
and i'm delighted to talk to you today a
little bit about how we use
simple in mixed reality projects both in
research
and in products to set the stage
i'd like to start by showing you a short
video to introduce microsoft mesh
a new service we have that enables
shared presence and shared experiences
through mixed reality
connection is a spark that gives our
lives meaning
[Music]
it drives us to seek out others who feel
the same way
okay why don't you input the data and
we'll take a look together
hey mario what you got for me to find
those who share our views yet offer
different perspectives
look over here challenge us with new
ways of seeing
but i think we should deepen our
understanding
[Music]
and enrich our lives
[Music]
so for you
great things happen when we commit to
something bigger than ourselves take a
closer look at it
place this here let's see how we go from
there okay
this sense of collaboration and the
feelings of connection it brings
excites us hey just in time
i'm gonna move it slightly okay it's
yours take it
we have two planes right now on the same
trajectory as we put people
first technology fades into the
background
and feels like anything but aisha what
do you think i think if we had 330
maintaining 2 800 we'll be clear for
approach
this changes the way we see the world
and in turn
changes the world we see these numbers
are looking great actually
there's promise in the possibilities and
what we see and create next
will stretch the imagination good
morning sarah morning
slowly coming towards the thumb world
without
a lot better than boundaries yeah
excellent
slowly bringing the world where
technology enhances not limits
humanity
with people front center and in the
spotlight
the future is here and here
can be anywhere introducing microsoft
mesh
[Music]
wow what a video we covered a lot of
different ways
to share experiences in mixed reality
one of these ways is with avatars these
3d models of people
our goal is that these avatars represent
the user as faithfully as possible
and for that we use motion models which
uses simple
now that video was awesome but also
included quite a bit of smoke and
mirrors
so what does the real thing look like
let's look at what's going on under the
hood
on the left you can see one of my
colleagues tom and on the right you can
see his avatar
a simple body which is animated uniquely
using the input
received from the whole lens he's
wearing to be very clear here
there are no external cameras used for
any sort of tracking
we're only using the head and hand
tracking signals from the hololens one
by tom
to infer the pose and motion of his
upper body and use this to drive an
avatar
which moves just like it
so here you can see tom performing some
you know typical whole lens hand
tracking motion
and what you can see now are the actual
input signals that go into the system
the head pods and the hand pose in this
case and you can sometimes see the hand
poses jump in and out
as the hands go in and out of the whole
lenses field of view
but still we're able to recover most of
tom's pose pretty faithfully
so how does this work clearly
one problem that we're dealing with is
very limited data
as you can see in the video on the left
we receive just head pose
hand pose and fingertip locations from
the hololens
the problem of getting a human body pose
from just this very sparse signal
is incredibly under constrained so how
can we make this tractable
so we did this by taking advantage of
simple which acts as a strong prior
for the unconstrained problem we have at
hand
so now we can do something quite typical
recovering simple's pose by minimizing
an energy
including a data term to make simple's
head and hands line up with the signals
and a pose prior to encourage simple's
pose to be likely
however it's not good enough to be able
to fit simple to one-person signals at a
time
we want to enable telepresence to let
people communicate and collaborate
in mixed reality experiences and for
that we need to be able to run
fast this is so that we can run on quite
low spec cloud machines to run as
cheaply as possible to make a cloud
service viable
no monster gpu machines allowed here and
secondly this is because we need to be
able to support
lots of users connecting to a single
session at once
fortunately we already have plenty
in-house experience for doing body model
fitting in real time
for example the azure connect schedule
tracker includes an
excellent very efficient model fitter
that does rather similar stuff
so we adapted it for simple and
introduced a number of optimizations in
order to build a service
which runs efficiently and reliably in
the cloud
so let's take a look at the result
here is what motion models looks like
here you can see
two people who are hundreds of
kilometers apart and still interacting
like they were in the same room
each one is just using a whole lens no
additional cameras
their head and hand poses are being
streamed to a cloud service which is
fitting simple to both of them
in real time the upper video shows what
the purple user sees through their whole
lens
they can engage with the blue user see
their movement hear their voice
and even you know receive these virtual
markers for them and here's another
example of a couple of people
collaborating on a shared virtual object
and in this case i think it's a training
session you know how
explaining how to use some sort of
equipment
and when you try this sort of thing out
you can really see how these types of
interaction
makes virtual meetings which we all got
so used to last year a totally different
experience
so that's how we use simple in microsoft
mesh
but before we go into the next part of
this talk i'd like us to rewind a bit
the avatars we just saw there were being
driven using hand tracking
but where did these hand tracking poses
come from
here's a debugger view of a hand
tracking algorithm
running on hole lens 2. and you may be
surprised to find out
that hololens 2 hand tracking was
trained using synthetic data
training images made using computer
graphics you can see what that looks
like up here
and we had an awesome synthetic pipeline
for this but the problem was for us
that our synthetic hand models ended at
the forearm
so while the hand model looked very
realistic close up
and gave us great ground truth for
machine learning we couldn't make
realistic full frame images
and that's why we still had to use a lot
of real data with much simpler labels
bear in mind
for training this network which we
called the hand detector
for us this was frustrating so we had
this amazing synthetic pipeline for the
hand only
that gave us you know great machine
learning results but no ability to make
good synthetic data for this hand
detector that ran on the full frame
and so we asked ourselves what do we
need to do to get to a place where we
can have a hundred percent
synthetic training data we're in the
simple tutorial
so i think you may have already guessed
the answer
so we licensed simple and began our
journey of using simple for synthetic
data
and why use symbol for synthetics so
you know i believe parametric models are
a great foundation for making synthetic
data
and simple as the best-in-class
parametric body model around
large pose databases like a mass exist
and we can sample from these
and it has a traditional formulation
which makes it quite amenable for
graphics tooling
and this means that we'll be able to
find a way to turn these know quite
plastically looking simples
into something much more realistic
realistic enough to be used as training
data
so collections like a mass are amazing
and provide
you know great pose diversity which is
super good for learning things like
generic pose priors but sometimes with
synthetics you'll want some very special
poses
which you cannot find in existing data
sets
for example no one has yet captured a
great set of poses that correspond to
typical
hand tracking style interaction for
mixed reality devices
fortunately we have our own motion
capture studio where we can suit up
and capture special poses as required
and that's what we can see here in the
middle
so this starts off as just optical
marker data so we have to run mosh
on the mocap data to turn these markers
into simple poses and that's what you
can see on the right
so the white dots are the optical
markers the blue dots and the simulated
ones and the blue match is the the final
simple result
and so we do this with our pie torch
implementation of marsh that fits
shape and pose to the entire sequence
simultaneously
so one really cool thing with simple is
how its shape space
acts as a data multiplier so
you know like you saw in the previous
slide we only have one person performing
poses on the stage
but we can take that post sequence and
switch up the body shape completely and
then re-render it
and this is why parametric models are so
great for synthetics they really empower
you in making your data
as diverse as possible so now we have
some suitable poses
time to make our symbols look realistic
simple is an awesome model of the
geometry of naked people but we do need
to add some skin
and so this is where we start leaning
into the visual effects toolbox
and pull out all the usual tricks to
make a low poly mesh like simple
look convincingly realistic we start
with these really nice high quality
photogrammetry scans that we licensed
from 3d scan store
and we fit simple to them baking out
textures for albedo displacement and
bump
then we subdivide simple to get a lot of
really high resolution geometry and
apply these materials and the results
can start to look good
you can see the sort of normally bits of
bone poking out about the wrists and
you know this guy sticks back
we all want to run machine learning on
people who aren't naked
so we need to dress simple there are
many methods out there now for making
simple look good
clothed and they generally generally
rely on having some sort of clothing
geometry on top of simple
we went with the visual effects approach
using marvelous designer which is this
piece of software you see here to
prepare the clothing
clothes made in this way have really
nice uv maps for material texturing
and can be simulated to get these really
realistic draping and wrinkles
but what do we do when simple moves
around we can try to simulate the
clothing over a post sequence but things
get
really computationally expensive and
rather unstable
so we have to find another way
and here's a way we get inspired by the
simple community
at iccb 2019 we saw text to shape and it
was really impressive how well
a symbol could be made to look clothed
using just displacement
so beforehand we were rigging each
clothing item up as a mesh
and things started to get a bit
complicated with many different clothing
items all needing the entire set of
simple blend shapes transferred
but with this displacement map approach
things got a lot easier
we could set up clothing as a material
only meaning the only mesh in the scene
is simple and all the clothing detail is
applied in the shader
and the good news is that this can look
pretty good as you can see here
with enough high resolution geometry and
when we bake the maps right even tiny
details of the buttons are visible for
close-ups
and we can use another bump texture
still to retain fabric detail
here you can see a selection from our
digital wardrobe once we figured out
that this displacement map clothing
workflow was viable
we partnered with another team in
redmond who really scaled the whole
thing up
the approach is now to author different
parts of outfit separately
so tops bottoms and shoes all have their
own displacement map and these can be
kit bashed or composited together
into a single coherent outfit and these
materials are really easy to apply to
the simple mesh
we are still building this library up
but i think it's amazing how well
different types of clothing can be
represented
with this technique
so here's an example of what this can
look like so on the left you can see
simple
clothes having mocap animations being
replayed on top of it
and on the right you can see a simulated
egocentric view
so this is a view a bit like what you
might see from a head mounted device
and i think you'll agree this looks
pretty realistic
now synthetic data is a bit pointless
unless we actually do some machine
learning
so let's finally render a bunch of
synthetic training data out for hand
tracking
and in the bottom left you can see some
examples
and well as they say the proof is in the
pudding
using simple base synthetics alone we
can train machine learning systems that
have no problem generalizing
to real data for these challenging tasks
on the top right you can see hand
detection working quite robustly
these colorful blobs of confidence maps
being drawn to show the estimated
locations of the hands
and in the bottom right you can see some
hand keypoint estimation that's working
pretty well so just to re-emphasize
these neural networks were trained with
simple-based synthetic data
only they never saw a single real image
and these are some of the first results
that convinced us that simple could be
used to fully replace real data
in a project like hand tracking and if
you know how hard it is to collect real
data this is a super exciting prospect
thank you so much for listening it was
really fun to talk to you today
about how we use simple in mixed reality
please do get in touch with us
if you're interested in working with us
at the forefront of ai for virtual
presence
and please check out our sponsor session
tomorrow where myself and my colleague
tadas will be talking a little bit more
about how we use synthetic data thanks
very much
Loading video analysis...