Navigating the Service Mesh Ecosystem - George Miranda, Buoyant, Inc. & Diogenes Rittori, Pivotal
By CNCF [Cloud Native Computing Foundation]
Summary
## Key takeaways - **Service Mesh: An Application-Focused Network**: A service mesh functions as an application-focused network, with its features designed to benefit the application rather than the network itself. This is particularly relevant when breaking down monoliths into microservices, entering the realm of distributed systems. [03:12], [03:51] - **Beware the Fallacies of Distributed Systems**: Ignoring the fallacies of distributed systems, such as unreliable networks, variable latency, and changing topology, can lead to significant problems in application development. These fallacies, identified over fifteen years ago, remain critical considerations. [04:34], [05:53] - **Service Mesh Adoption: Need vs. Fashion**: Adopting a service mesh should be driven by a genuine need to solve specific problems, not by industry trends. If you can easily name all your services or deploy code infrequently, a service mesh might be unnecessary complexity. [08:10], [09:03] - **Linkerd v1: Multi-Platform, Heavyweight**: Linkerd v1, the first tool to use the 'service mesh' term, is a battle-tested, multi-platform solution. However, its Scala-based, JVM-dependent architecture results in a significant footprint, making it a heavyweight option. [12:06], [13:54] - **Envoy: C++, VM-Centric Origins**: Envoy, originally written in C++ for virtual machines, was created to solve integration problems for companies like Lyft. While powerful, its C++ requirement for extensions and its VM-centric origins are key differentiators. [15:10], [16:10] - **Istio: Feature-Rich but Complex**: Istio offers extensive features and significant engineering effort from companies like Google and IBM. However, its complexity, with over 50 custom resource definitions, means its adoption requires careful consideration of the added management overhead. [16:54], [17:51]
Topics Covered
- Service Mesh: An Application-Focused Network
- The Fallacies of Distributed Systems Still Apply
- Is Service Mesh Really For You? Ask These Questions First
- Are You Ready for A/B Testing with Service Mesh?
- Service Mesh vs. Network Policies: A Compatibility Challenge
Full Transcript
hello welcome to navigating the service
mesh ecosystem before we get started I
want to introduce the talk a little bit
so we'll get into introductions my name
is Jorge this is do Janice we are
currently not working for companies that
make products that are in this
presentation but we have both been very
involved in the service mesh ecosystem
either by making contributions or giving
talks writing books or articles about
the service mesh service mesh technology
how to approach it and as a former
engineer one of the things that we've
seen is a lot of interest in the service
mesh and more products being introduced
and the enthusiasm is really good
there's a lot of uptake of service mesh
technology but with all the enthusiasm
comes a lot of questions and a lot of
confusion around which service mesh you
should use and most of the time you see
a presentation from a vendor that is
talking to you about their service mesh
and why that's the right one to use so
we thought we would do something a
little different and give a talk that is
vendor neutral that looks at the
different projects in the ecosystem and
tries to give you as engineers the right
questions to ask to figure out which
solution is right for you there's no one
Universal answer to that and hopefully
this presentation will help you figure
out which approach is right for you
so with that we'll do introductions my
name is Jorge I used to work at buoyant
they're the makers of the linker D
service mesh I was director of community
there I'm now at a company called pager
duty
I still do community there all right
phaedra Duty users and so I I do a lot
of things like this I work with a lot of
users to help you figure out you know
how to make infrastructure management
better and easier to approach and so
hopefully this talk does a little bit
more of that
Thanks thanks jarred so my name is DNS
root or as you can see I'm not the same
person from the picture I have a little
bit more hair than when I had that
picture was taken that's that's just
getting to me
so I also been involved in service mesh
space for a while it was able to give
presentations write articles and while
at Red Hat I was a product manager on
the service mesh capability of open
shift if you've heard of that so very
excited to be here talking to you I also
want to thank our sponsors the companies
that employ us and pay us the tickets to
get here so thank you so and I'm sure
you're very happy to have a major duty
sponsoring you as well here right and
you've seen introductions to service
smash many times and I think you've seen
like what is the service mesh and George
said like we want to give you a
different perspective on this you know
so we will talk about technical features
of what makes a service mesh but some a
different perspectives so this is the
way I explain what a service mesh is
which is an application focused network
right I think the service mash on
purpose does not have the name network
in it then because you'd have to deal
with network falls and you know it's
always hard right
so it's an application focused network
and the features that the service mesh
deliver they exist to benefit the
application right so they do not exist
necessarily for the benefit of the
network itself but the capabilities that
a service Mash deliver they exist
actually to benefit the application so
I'm going to invest some more time in
addressing the needs of
distributed systems and that's the
important point because once you are
dealing with applications that need to
communicate and once you are doing a
process of breaking a monolith
into individual micro services or into
individual smaller applications if you
are doing that you are in the real or in
the space of distributed systems right
so I think and I strongly believe that
failing to acknowledge that you are in
this space of distributed systems will
bring you a lot of problems right so and
to present you some of the problems I
love talking about the fallacies of
distributed systems so these fallacies
which means they are all false right was
created more than fifteen years ago when
people were starting to develop
distributed systems and they're still
very valued today so the first one is
that net in that network is reliable but
then I was thinking that myself as a
software engineer I would always write
cold assuming that the network was
reliable that when I'm going to ask my
application to connect or to integrate
with another application I assume that
there will be a reliable network there
and that's not true you know networks
are not reliable networks will go down
you will have a problem at some point in
time and the same can be said about
latency you know so there is latency in
the network it's you don't have infinite
ability to transmit and receive data you
don't have infinite ability to process
data so when you develop applications
and you forget to acknowledge the
fallacies of distributed systems you
will be in trouble right so some of the
others one that I especially find
interesting is that the topology judge
doesn't change you know so you are
writing an application assuming that
you're going to connect to a certain
database and you're referring to the
database with the IP of the database but
in the development phase it's one IP and
in production it's a different IP so
just in that way the top
ecology has changed the pathology from
when you were developing the application
it's different from the topology from
when you're running the application
right so again acknowledging that the
topology change it's very important you
might think about service discovery then
you might think about externalized
configuration so again it's important
when you're developing applications to
recognize this right
also term for transport cost being zero
very very important and and the service
mesh technologies and the relationship
that exists here it's not let's say a
hundred percent precise oh you can't be
you can say that there are a hundred
percent one-to-one the relationships
that exist here but I think there is
some sense to that to the relationships
between the fallacies of distributed
computing and the capabilities that a
service smash delivers right so just
take for example the network is reliable
if I cannot trust on the network then I
will protect myself from that I the
engineering of the application I will
protect and how do I do that I will
think about adding maybe circuit break
into my application automated retries
some sort of load balancing so that the
application can try to come to a
trustworthy just worried scenario
without having to do much work right
again network is secure we all know the
network is not secure so why don't you
think about making sure that the
communication channels between the
applications are encrypted so and that
leads us to this very important point
which is that service mesh is not for
you right we think and you can probably
go watch another session right there
so you just leave the room right now and
go watch another session right and and
the reason is is that I think we fail to
do the necessary engineer and architect
work of making sense of things so we
fail to really do a proper analysis to
see if we really need to use technology
right maybe this is a fashion industry
but I don't think it should be a fashion
industry I think there should be a need
to use technologies and one example is
can you name the name of all your
services well if you probably know all
the services that you are talking to
that means that the number of services
is not that big right
you know like 20 now you can name 20
services and that's all about what your
application are going to interface with
hmm maybe service smash is not for you
right maybe even kubernetes is not for
you right if your space is not so big
another one do you know how many times
you deploy called per month oh we do we
do twice a month ok do you think it's
worth the burden extra complexity extra
managerial needs of a service smash if
you do deployments twice a month it's
important as architects and engineers to
make that consideration right because
again if not we're just in a fashion
industry and I think building on that
right a good question to ask also when
figuring out if the service measure is
right for you
is when something goes wrong when
there's a failure do you know which
services failed and why is it very clear
when something goes wrong where that
failure occurred or do you have so many
services that the interdependencies are
very difficult to determine if that's
the case right if it's more complex when
something goes wrong you're not really
sure where the problem is
then maybe you actually have a need for
this but if the answer to this is yes
something goes wrong and I know exactly
what failed then maybe the service mesh
is not for you and then the most
important question of all right do you
want to use the service mesh because
it's cool because it's great technology
and you want to tell your friends I am
using a service mesh because it's great
if the answer to that is yes then the
service mesh is probably not for you and
so what we're getting to you right is
that there's some very important work
you need to do here to ask what problems
you are having why do you need a
solution like the service mesh do you
have a lot of complexity
are you in that distributed systems
world right are you running up against
problems like having more than one
administrator having a network that is
not homogeneous right if you are having
those issues then maybe the service
meshes for you and they stayed so that's
a stage right so you didn't go watch
another session so so let's say that
we've done that work right and we don't
know how many times we've deployed in
the last month right we can't name all
of our services we know that we have a
very complex distributed application
that we are managing well then maybe you
actually have a real problem and then
maybe at that point a service mesh makes
sense for you to use but now right you
know that you need a service mesh it's a
fundamental building block in a cloud
native stack which one do we use and so
that's the point of this talk right the
idea is that we're going to look at the
different options that are available in
the cloud native ecosystem and there are
a number of other service mesh options
we'll talk about those a little bit
later but today we're going to focus on
the products that fall into the service
mesh category that are popular when
you're using kubernetes or in the cloud
native ecosystem we're going to go
through this list historically in order
of what project came along when so we're
gonna start with linker D and linker D
has two versions right we heard Liz Rice
on stage talk about version two we're
gonna look at linker D version one and
the way to think about linker DV one
versus V two is a little bit like Apache
right there's a patchy version one and
Apache version two and they have
different problems that they solve there
are different bits of software it's not
just an upgrade from one to the other so
linker DV one has been around for almost
three years
as of February it's tried and true right
we've it's more than a trillion requests
served in production it's battle tested
and used but the way to think about
linker D is because it was written three
years ago it was written in a time
before kubernetes was the de facto
standard for container management
platforms right it's back when DCOs was
still a thing
dr. swarm was the thing right we thought
like that might work out and so linker D
is meant to be multi-platform so the way
to think about linker D is if you have
services that you are trying to manage
outside of just kubernetes then this
solution might be for you and when you
look at linker D right linker D was the
first tool to use the term service mesh
for this category of tool and for this
category of tool right the things that
you would expect are there you know
resiliency features latency aware load
balancing circuit braking retries
automatic TLS very deep language for
specifying how you do per route
configurations all of those built-ins
are there but here's how to think about
linker D version one it's a very
powerful solution but it's also a very
heavyweight solution and so it's written
in scala right which means I need the
JVM to run so the footprint of that can
be pretty big there's been some recent
work to get it working on growl VM and
make that footprint a lot smaller but
it's still significant so the reasons
that you might go this route are again
if you have a multi-platform use case or
you need that type of complexity you
need all of those heavyweight features
then linker D might be for you and might
be a place to start and now you go to
envoi right I mean we've seen some some
talks on my already my just yesterday or
today was accepted to be the third
project to move out of incubation CN CF
so I think who they'll probably make an
announcement so known that by Kenny got
the amount of votes needed to become a
graduated project and what I especially
like about envoy's that it was it was
not a solution looking for a problem it
was someone my client had left that he
had a problem on how to integrate
didn't connect multiple services in a
very prefer 4matic way and he decided to
create something right so it was created
very much for the use case of lift which
is a ride-sharing company like like Dede
here in China and is it's it's and we go
into the very purpose of this
presentation it's it's important to know
the differences right so for example as
let's say opposed or different than
linker dv1 and boy was written in c++
right so that means that those that want
or need to extend and voy first for for
whatever reason they need to be able to
do that in c++ and i think the reality
is that are probably more java engineers
or today that there are c++ engineers so
all those considerations have to be
taken into account when you make the
decision of choosing a technology right
also interesting point about invoice
that was created for a world without
containers as well it was created we
know were all for virtual machines so
the deployment pattern was that each
virtual machine will have one service
and together in the virtual machine the
proxy will sit there so it's like
similar to containers but instead of
like one container per application it
will be like one virtual machine per
application again the project has gained
a lot of popularity it's a it's a great
technology and it was extended it was
extending in many different ways one of
the ways that the project was expanding
is with Sto so the engineers work
together to create extensible interfaces
in envoi that allow you to publish new
rules when voice in a way that anyone
that wants to do that it's it's it's
facilitated again important point is
that issue it's not necessarily and
technically a Sto sorry a CNC F project
but I think it's it's fair to introduce
it here right so II still has a very
very strong development from from Google
from IBM there are other companies such
as Red Hat such as people tow they also
participate in the development but it's
it's very interesting to see how much
engineering effort has been put by by
IBM and Google on this right so IBM was
doing is
similar project I think called
amalgamate and they decided to unite
those projects and to live and to used
to and make into a single thing and and
the table that I showed earlier pretty
much talks about the some of the issue
of features and how they map to
distributed to distributed computing
right so Easter was has been evolving I
think I think it's fair to say that
issue is becoming a little bit
complicated if you're just starting now
there are more than if you know Q
Burnett is a little bit there are more
than 50 different customer source
definitions that you can use inside
issue to configure it to your liking and
again the point right the extra
complexity when dealing with service
smash has to be worth it right so if
you're really willing to use a
technology that is going to require more
knowledge to manage and probably
different people to manage you have to
make that decision you can't just use
without thinking about that but again
issue is great so it does a great amount
of work in generating certificates in
rotating and distributing those
certificates for you I mean just
thinking about the distribution of
certificates in a non scenario like this
where you would have to keep doing the
DES manual rotation so that's that's
always a problem
and again if so does that it works great
and it's it's been it's been involving
pretty good and and now we'll talk about
linker D version 2 and it's interesting
for me actually to hear this talk
unfolding and it's historically how
things have been introduced because it
makes a lot of sense for where link or
db2 is gone and so if you look at the
solutions before right we've been
talking about distributed systems
problems and all of the fine grained
features that you need to solve some of
those challenges but I think like we
said in the beginning of the talk not
everybody needs all of that complexity
and so this is where I think linker D
version 2 comes in so linker D version 2
is a complete rewrite there is no code
from linker DV 1
in v2 and what b2 was all about was
looking at the lessons that were learned
from running linker D in production and
what Boyan discovered was that a large
portion of customers were not making use
of a lot of that complexity right there
are a couple of common problems that you
have right away no matter how large your
application is when you start using
distributed services and even containers
on a small scale and those problems tend
to be observability rule whoa my fault
sorry those problems tend to be
observability security and performance
right you want this thing to be fast you
don't want it to introduce any sort of
latency right and that's why you see the
proxying components written in rust
they're very small they're very scalable
they're very fast and you see the
features that are focused on which is
you get service level metrics you can
see what failures are happening which
exact service calls might be failing and
you get a lot of visibility into the
session layer without having to encode
that into your applications the other
thing that you get automatic TLS right
that certificate management ssl upgrades
right that seems to be a very common
feature that has a lot of value and the
linker DV to use case is centered around
service owners right so it's very easy
to incremental e adopt either by pod
right or by service and you don't have
to deploy it across your entire platform
which means if you're not the only team
using your kubernetes installation right
you can use linker D and talk to other
non linker D services and you can do
that incrementally and you can do that
easily so it's really centered around
what are the things you probably need
first and how can we make that easy to
use but the downside is because it's
zero config because it just works easily
out of the box if you need a lot of that
deep powerful configuration if you need
features like custom routing right some
of the things that are in the heavier
weight solutions it's not they're so
easy to get started but maybe not as
powerful as some of the others and
there's a there's an important point
there is that the
if you go to a dog for example sto there
is not so much tuning around Easter
today
you know it's if you do if you add let's
say conflicting rules in Easter like if
you do like a router or this in a
virtual service rule in Sto that's going
to conflict with a network policy in
kubernetes nothing is going to tell you
that you know and so someone that
controls the network in kubernetes
through network policies may have said
namespace a cannot talk to namespace B
right and then still in your SEO control
plane or in your configuration you're
assuming that service a in namespace a
can talk to service B so so so that's
why we make the point of you have to be
really cautious about the decision of a
service match so if you move to v1 and a
few months ago right so the tooling to
facilitate investigation of problems in
the usage of the technology itself it's
not there there's not a tool today that
will tell you exactly why a
communication did not go through you
know is it because of network policy
that blocked is it because of a specific
routing policy that was blocked so so
all of that used to have to kind of
develop something yourself or do some
manual investigation right of course we
believe that the tooling will get there
right it's just a matter of where we are
right now in the evolution of the
technology right and I think that's an
important point right focusing on where
you are what problems do you need to
solve most and how much complexity is
that worth to you right again all of
these tools have a learning curve
there's complexity and maintaining them
upgrading them understanding what
happens when those tools go themselves
go wrong right so again we're just
looking at what are the different
philosophies where does each type of
solution help you and so hopefully you
can use that to figure out which
solution is right for you well then the
question comes up all right well that's
a nice look at those four what about
other service mesh options
and I think what we've seen lately is
the number of additional solutions
whether it's Aspen mass or Kong right or
a number of other vendors that are now
playing in the service mesh space an
engine ax shirt there as well yeah and
so the idea really is where we only have
35 minutes today so we're trying to
cover as much ground as we can we're
looking at specifically what is mostly
used in the cloud native ecosystem but
again I think the things that you should
look for are hints around which problems
they solve and look at does that match
the pain that you are feeling with your
applications today and and with that I
think we want to let's say summarize
this presentation by giving you a list
of questions that you should be asking
when you need or if you need to make a
decision on whether or not service mesh
is for you and which service mesh
technologies for you or even can you do
the same needs or can you address the
same needs with less technology right
because if you take ECU for example I
mean the default is do let's say package
comes with comes with prometheus comes
with grow fauna they just added key ally
so you will be bringing a set of tools
that bring more complexity so that has
to be thought through you know so you
should be asking yourself the problems
that you need to solve today can I solve
them without a service mesh is the
solution sustainable if it is and if you
can then why the extra complexity and
I'm a big believer in service mesh but I
am also a big believer in architectures
that make sense right architectures that
are created in a way that they are
sustainable and scalable and not just
again making this a fashion industry
right I think you have some that you
also have some thoughts about yeah
absolutely and so I let's let's let me
say this not to be very negative about
not meeting a service smash
right I can tell you if you are using
containers if you are using kubernetes
you definitely have some of these
problems right and one of the first
problems that you have is what's
happening at the session layer right
when I'm making calls to other services
if they fail do I know they're failing
right do I know why they're failing can
I even tell that they're failing and I
think all of the different service mesh
options help you solve at least that
problem right some of the other problems
you might have again around managing
security or consistency or sub
permissions around particular services
you should take a good hard look at what
are the challenges that are facing you
right and which problems hurt the most
based on that apply those to the
different solutions and you might have
an idea of where to go from there
right there are some other tactical ones
like which platforms you need to support
right what functionality you already
have but I think a good one that we
haven't talked about yet is who owns
your services right is that in the hands
of developers do they need to mostly set
permissions around how these services
are configured what they're talking to
and maintain control of that service
communication or is that platform owners
right do you own your kubernetes
installation or to somebody else and I
think one of those questions helps you
figure out at least what type of
configuration you want do you want
everything centrally managed do you want
distributed control over those things
and so hopefully you can start using
this list to answer some of those
questions for yourself I remember that Y
Red Hat I was doing an interview of a
few customers that were interested in
service smash and they would see the
list of capabilities that a service
deliver and they would get excited and
then I asked are you ready today to have
two different versions of the same
application running so you're interested
in a B test does your CI CD pipeline
support having two different versions of
the same application running at the same
time well the answer was no we we're not
ready to do a B test right so and then
yet they were interested in going into
the service match without let's say
fixing the CI CT pipeline first right so
there is a
lot of this needs to be thought through
and this is just one example that came
to my mind that I think it's it's very
important yeah so a lot of a lot of
things look really good on paper but
again it's a question of what are you
really going to use what is critical to
you right now and maybe that can inform
your decision so with that we're gonna
leave some more resources as well
there's a write up that is on here that
is a much longer version of this talk
with a little bit more detail around
those questions and additional places to
probe a couple of introductory books
both to the service mesh and to SEO
there's also a great talk which is four
reasons why you need SEO which is
actually looking more at the distributed
systems problem and so if you have other
questions we'll have a little bit of
time for Q&A
I think still yeah and of course we'll
be around and you can always reach out
to us online so thank you so we have a
few minutes if you have questions you
can ask as I know you know Cisco
announced a network service mesh yes I
Isis that it's just Vista problem was
the
network and Cheney yeah to note actually
he can fist fist connection and and
connection the tendo for the application
for the application ya know for network
application just my opinion what do you
think I mean I think there's there will
have to be integration between Sdn
providers whatever yes then provider it
is in service smash technologies right
because when network policies which is a
great technology was thought through it
kind of a little bit Kemp income can be
conflicting with with service mesh I
think another way of seeing this is who
created the technology right so the
technology created the ecco technology
was a lot of contribution from Google
and Google essentially likes to use what
they call flat network where all
applications
sorry where all computers in the network
can should be able to communicate with
all computers but applications not
necessarily so the network access is
available but then there is a
certificate between every single
application that needs to talk to so
it's a it's a different way of thinking
about so you like the way Google will
think about is like the network is open
all ip's are available but you actually
protect the application right and the
way network policies was drawn us
thought through is more like networks
are that you have a firewall you in your
firewall you decide which app which
services can talk to a please what
services so I think there's there's this
integration that still needs to happen
in the technology to make sure that the
rules of your network will be consistent
with the rules of your service smash and
it's it's technologies it's doable so I
think I think it's not there yet but it
will have to get there you know
customers will need that they already do
need that yeah thanks for the question
yeah hi we're actually building a multi
cluster infrastructure so my question is
how to set up the pot level or service
RP level connectivity with a few native
solution do you have any clues on that I
think I think if you want to to
introduce service smash to that there's
a few things that are being thought
through at least in the Easter
with-with-with zone gateways which is
just like border gateways that have been
used forever you have a border gateway
between the multiple clusters and you
share this space where the services are
registered right so in a multi cluster
if you want one service in a cluster to
talk to a service another cluster
both clusters have to know what services
are available right so the solution has
to be a centralized service discovery
and there are great solutions for that
Council for example from Ohashi Karp
it's a great solution for that but then
I think you have to think about doing
egress and egress
properly and I last I checked sto group
was doing work on the Gateway zone Gate
race or multi class circuit race so to
answer your question in a link or D
world there's just a simple proxying
component that you put in front of every
pod right and the interface for that pod
is that is that proxy right it's
transparent to every other endpoint so
as long as you expose it right you can
talk to any other service you can
connect it it's a very simple approach
right but it's one that's very easy to
compose I think we have two minutes we
have time for maybe one more question
this the question was have we tried
console Connect I have not tried console
Connect yeah I have myself not tried
console Connect i I have I sense a
diminished investment in that technology
since so was that a question or a
statement that's kind of a statement
that I'm not so sure about
oh I can I can I because I have very
strong opinions on that well the
question was do we see a world where
service mesh technology will get
consumed into API management solution I
I see the other way around and that's
why I say I have very strong opinions on
that is that if the IP I manage your
solutions don't start thinking about it
they business their business are gonna
go away because rating and limiting its
there right so you had a decent
developer portal with billing on top of
a solution that already does rating
limiting that's somewhat of okay of API
management solution right and I say this
is because the folks at at least that I
know their work and API management
solutions at Google they are the same
kind of group that I also work on
service smash so I think they will
become the same thing or they will cease
to exist or our API manageable cease to
exist that's a very strong opinion on
that and with that I believe that brings
us to time so thank you very much for
your time can take questions outside
thank you
[Applause]
Loading video analysis...