Navigating the Service Mesh Ecosystem - George Miranda, Buoyant, Inc. & Diogenes Rittori, Pivotal

By CNCF [Cloud Native Computing Foundation]

Summary

## Key takeaways - **Service Mesh: An Application-Focused Network**: A service mesh functions as an application-focused network, with its features designed to benefit the application rather than the network itself. This is particularly relevant when breaking down monoliths into microservices, entering the realm of distributed systems. [03:12], [03:51] - **Beware the Fallacies of Distributed Systems**: Ignoring the fallacies of distributed systems, such as unreliable networks, variable latency, and changing topology, can lead to significant problems in application development. These fallacies, identified over fifteen years ago, remain critical considerations. [04:34], [05:53] - **Service Mesh Adoption: Need vs. Fashion**: Adopting a service mesh should be driven by a genuine need to solve specific problems, not by industry trends. If you can easily name all your services or deploy code infrequently, a service mesh might be unnecessary complexity. [08:10], [09:03] - **Linkerd v1: Multi-Platform, Heavyweight**: Linkerd v1, the first tool to use the 'service mesh' term, is a battle-tested, multi-platform solution. However, its Scala-based, JVM-dependent architecture results in a significant footprint, making it a heavyweight option. [12:06], [13:54] - **Envoy: C++, VM-Centric Origins**: Envoy, originally written in C++ for virtual machines, was created to solve integration problems for companies like Lyft. While powerful, its C++ requirement for extensions and its VM-centric origins are key differentiators. [15:10], [16:10] - **Istio: Feature-Rich but Complex**: Istio offers extensive features and significant engineering effort from companies like Google and IBM. However, its complexity, with over 50 custom resource definitions, means its adoption requires careful consideration of the added management overhead. [16:54], [17:51]

Topics Covered

Service Mesh: An Application-Focused Network
The Fallacies of Distributed Systems Still Apply
Is Service Mesh Really For You? Ask These Questions First
Are You Ready for A/B Testing with Service Mesh?
Service Mesh vs. Network Policies: A Compatibility Challenge

Full Transcript

hello welcome to navigating the service

mesh ecosystem before we get started I

want to introduce the talk a little bit

so we'll get into introductions my name

is Jorge this is do Janice we are

currently not working for companies that

make products that are in this

presentation but we have both been very

involved in the service mesh ecosystem

either by making contributions or giving

talks writing books or articles about

the service mesh service mesh technology

how to approach it and as a former

engineer one of the things that we've

seen is a lot of interest in the service

mesh and more products being introduced

and the enthusiasm is really good

there's a lot of uptake of service mesh

technology but with all the enthusiasm

comes a lot of questions and a lot of

confusion around which service mesh you

should use and most of the time you see

a presentation from a vendor that is

talking to you about their service mesh

and why that's the right one to use so

we thought we would do something a

little different and give a talk that is

vendor neutral that looks at the

different projects in the ecosystem and

tries to give you as engineers the right

questions to ask to figure out which

solution is right for you there's no one

Universal answer to that and hopefully

this presentation will help you figure

out which approach is right for you

so with that we'll do introductions my

name is Jorge I used to work at buoyant

they're the makers of the linker D

service mesh I was director of community

there I'm now at a company called pager

duty

I still do community there all right

phaedra Duty users and so I I do a lot

of things like this I work with a lot of

users to help you figure out you know

how to make infrastructure management

better and easier to approach and so

hopefully this talk does a little bit

more of that

Thanks thanks jarred so my name is DNS

root or as you can see I'm not the same

person from the picture I have a little

bit more hair than when I had that

picture was taken that's that's just

getting to me

so I also been involved in service mesh

space for a while it was able to give

presentations write articles and while

at Red Hat I was a product manager on

the service mesh capability of open

shift if you've heard of that so very

excited to be here talking to you I also

want to thank our sponsors the companies

that employ us and pay us the tickets to

get here so thank you so and I'm sure

you're very happy to have a major duty

sponsoring you as well here right and

you've seen introductions to service

smash many times and I think you've seen

like what is the service mesh and George

said like we want to give you a

different perspective on this you know

so we will talk about technical features

of what makes a service mesh but some a

different perspectives so this is the

way I explain what a service mesh is

which is an application focused network

right I think the service mash on

purpose does not have the name network

in it then because you'd have to deal

with network falls and you know it's

always hard right

so it's an application focused network

and the features that the service mesh

deliver they exist to benefit the

application right so they do not exist

necessarily for the benefit of the

network itself but the capabilities that

a service Mash deliver they exist

actually to benefit the application so

I'm going to invest some more time in

addressing the needs of

distributed systems and that's the

important point because once you are

dealing with applications that need to

communicate and once you are doing a

process of breaking a monolith

into individual micro services or into

individual smaller applications if you

are doing that you are in the real or in

the space of distributed systems right

so I think and I strongly believe that

failing to acknowledge that you are in

this space of distributed systems will

bring you a lot of problems right so and

to present you some of the problems I

love talking about the fallacies of

distributed systems so these fallacies

which means they are all false right was

created more than fifteen years ago when

people were starting to develop

distributed systems and they're still

very valued today so the first one is

that net in that network is reliable but

then I was thinking that myself as a

software engineer I would always write

cold assuming that the network was

reliable that when I'm going to ask my

application to connect or to integrate

with another application I assume that

there will be a reliable network there

and that's not true you know networks

are not reliable networks will go down

you will have a problem at some point in

time and the same can be said about

latency you know so there is latency in

the network it's you don't have infinite

ability to transmit and receive data you

don't have infinite ability to process

data so when you develop applications

and you forget to acknowledge the

fallacies of distributed systems you

will be in trouble right so some of the

others one that I especially find

interesting is that the topology judge

doesn't change you know so you are

writing an application assuming that

you're going to connect to a certain

database and you're referring to the

database with the IP of the database but

in the development phase it's one IP and

in production it's a different IP so

just in that way the top

ecology has changed the pathology from

when you were developing the application

it's different from the topology from

when you're running the application

right so again acknowledging that the

topology change it's very important you

might think about service discovery then

you might think about externalized

configuration so again it's important

when you're developing applications to

recognize this right

also term for transport cost being zero

very very important and and the service

mesh technologies and the relationship

that exists here it's not let's say a

hundred percent precise oh you can't be

you can say that there are a hundred

percent one-to-one the relationships

that exist here but I think there is

some sense to that to the relationships

between the fallacies of distributed

computing and the capabilities that a

service smash delivers right so just

take for example the network is reliable

if I cannot trust on the network then I

will protect myself from that I the

engineering of the application I will

protect and how do I do that I will

think about adding maybe circuit break

into my application automated retries

some sort of load balancing so that the

application can try to come to a

trustworthy just worried scenario

without having to do much work right

again network is secure we all know the

network is not secure so why don't you

think about making sure that the

communication channels between the

applications are encrypted so and that

leads us to this very important point

which is that service mesh is not for

you right we think and you can probably

go watch another session right there

so you just leave the room right now and

go watch another session right and and

the reason is is that I think we fail to

do the necessary engineer and architect

work of making sense of things so we

fail to really do a proper analysis to

see if we really need to use technology

right maybe this is a fashion industry

but I don't think it should be a fashion

industry I think there should be a need

to use technologies and one example is

can you name the name of all your

services well if you probably know all

the services that you are talking to

that means that the number of services

is not that big right

you know like 20 now you can name 20

services and that's all about what your

application are going to interface with

hmm maybe service smash is not for you

right maybe even kubernetes is not for

you right if your space is not so big

another one do you know how many times

you deploy called per month oh we do we

do twice a month ok do you think it's

worth the burden extra complexity extra

managerial needs of a service smash if

you do deployments twice a month it's

important as architects and engineers to

make that consideration right because

again if not we're just in a fashion

industry and I think building on that

right a good question to ask also when

figuring out if the service measure is

right for you

is when something goes wrong when

there's a failure do you know which

services failed and why is it very clear

when something goes wrong where that

failure occurred or do you have so many

services that the interdependencies are

very difficult to determine if that's

the case right if it's more complex when

something goes wrong you're not really

sure where the problem is

then maybe you actually have a need for

this but if the answer to this is yes

something goes wrong and I know exactly

what failed then maybe the service mesh

is not for you and then the most

important question of all right do you

want to use the service mesh because

it's cool because it's great technology

and you want to tell your friends I am

using a service mesh because it's great

if the answer to that is yes then the

service mesh is probably not for you and

so what we're getting to you right is

that there's some very important work

you need to do here to ask what problems

you are having why do you need a

solution like the service mesh do you

have a lot of complexity

are you in that distributed systems

world right are you running up against

problems like having more than one

administrator having a network that is

not homogeneous right if you are having

those issues then maybe the service

meshes for you and they stayed so that's

a stage right so you didn't go watch

another session so so let's say that

we've done that work right and we don't

know how many times we've deployed in

the last month right we can't name all

of our services we know that we have a

very complex distributed application

that we are managing well then maybe you

actually have a real problem and then

maybe at that point a service mesh makes

sense for you to use but now right you

know that you need a service mesh it's a

fundamental building block in a cloud

native stack which one do we use and so

that's the point of this talk right the

idea is that we're going to look at the

different options that are available in

the cloud native ecosystem and there are

a number of other service mesh options

we'll talk about those a little bit

later but today we're going to focus on

the products that fall into the service

mesh category that are popular when

you're using kubernetes or in the cloud

native ecosystem we're going to go

through this list historically in order

of what project came along when so we're

gonna start with linker D and linker D

has two versions right we heard Liz Rice

on stage talk about version two we're

gonna look at linker D version one and

the way to think about linker DV one

versus V two is a little bit like Apache

right there's a patchy version one and

Apache version two and they have

different problems that they solve there

are different bits of software it's not

just an upgrade from one to the other so

linker DV one has been around for almost

three years

as of February it's tried and true right

we've it's more than a trillion requests

served in production it's battle tested

and used but the way to think about

linker D is because it was written three

years ago it was written in a time

before kubernetes was the de facto

standard for container management

platforms right it's back when DCOs was

still a thing

dr. swarm was the thing right we thought

like that might work out and so linker D

is meant to be multi-platform so the way

to think about linker D is if you have

services that you are trying to manage

outside of just kubernetes then this

solution might be for you and when you

look at linker D right linker D was the

first tool to use the term service mesh

for this category of tool and for this

category of tool right the things that

you would expect are there you know

resiliency features latency aware load

balancing circuit braking retries

automatic TLS very deep language for

specifying how you do per route

configurations all of those built-ins

are there but here's how to think about

linker D version one it's a very

powerful solution but it's also a very

heavyweight solution and so it's written

in scala right which means I need the

JVM to run so the footprint of that can

be pretty big there's been some recent

work to get it working on growl VM and

make that footprint a lot smaller but

it's still significant so the reasons

that you might go this route are again

if you have a multi-platform use case or

you need that type of complexity you

need all of those heavyweight features

then linker D might be for you and might

be a place to start and now you go to

envoi right I mean we've seen some some

talks on my already my just yesterday or

today was accepted to be the third

project to move out of incubation CN CF

so I think who they'll probably make an

announcement so known that by Kenny got

the amount of votes needed to become a

graduated project and what I especially

like about envoy's that it was it was

not a solution looking for a problem it

was someone my client had left that he

had a problem on how to integrate

didn't connect multiple services in a

very prefer 4matic way and he decided to

create something right so it was created

very much for the use case of lift which

is a ride-sharing company like like Dede

here in China and is it's it's and we go

into the very purpose of this

presentation it's it's important to know

the differences right so for example as

let's say opposed or different than

linker dv1 and boy was written in c++

right so that means that those that want

or need to extend and voy first for for

whatever reason they need to be able to

do that in c++ and i think the reality

is that are probably more java engineers

or today that there are c++ engineers so

all those considerations have to be

taken into account when you make the

decision of choosing a technology right

also interesting point about invoice

that was created for a world without

containers as well it was created we

know were all for virtual machines so

the deployment pattern was that each

virtual machine will have one service

and together in the virtual machine the

proxy will sit there so it's like

similar to containers but instead of

like one container per application it

will be like one virtual machine per

application again the project has gained

a lot of popularity it's a it's a great

technology and it was extended it was

extending in many different ways one of

the ways that the project was expanding

is with Sto so the engineers work

together to create extensible interfaces

in envoi that allow you to publish new

rules when voice in a way that anyone

that wants to do that it's it's it's

facilitated again important point is

that issue it's not necessarily and

technically a Sto sorry a CNC F project

but I think it's it's fair to introduce

it here right so II still has a very

very strong development from from Google

from IBM there are other companies such

as Red Hat such as people tow they also

participate in the development but it's

it's very interesting to see how much

engineering effort has been put by by

IBM and Google on this right so IBM was

doing is

similar project I think called

amalgamate and they decided to unite

those projects and to live and to used

to and make into a single thing and and

the table that I showed earlier pretty

much talks about the some of the issue

of features and how they map to

distributed to distributed computing

right so Easter was has been evolving I

think I think it's fair to say that

issue is becoming a little bit

complicated if you're just starting now

there are more than if you know Q

Burnett is a little bit there are more

than 50 different customer source

definitions that you can use inside

issue to configure it to your liking and

again the point right the extra

complexity when dealing with service

smash has to be worth it right so if

you're really willing to use a

technology that is going to require more

knowledge to manage and probably

different people to manage you have to

make that decision you can't just use

without thinking about that but again

issue is great so it does a great amount

of work in generating certificates in

rotating and distributing those

certificates for you I mean just

thinking about the distribution of

certificates in a non scenario like this

where you would have to keep doing the

DES manual rotation so that's that's

always a problem

and again if so does that it works great

and it's it's been it's been involving

pretty good and and now we'll talk about

linker D version 2 and it's interesting

for me actually to hear this talk

unfolding and it's historically how

things have been introduced because it

makes a lot of sense for where link or

db2 is gone and so if you look at the

solutions before right we've been

talking about distributed systems

problems and all of the fine grained

features that you need to solve some of

those challenges but I think like we

said in the beginning of the talk not

everybody needs all of that complexity

and so this is where I think linker D

version 2 comes in so linker D version 2

is a complete rewrite there is no code

from linker DV 1

in v2 and what b2 was all about was

looking at the lessons that were learned

from running linker D in production and

what Boyan discovered was that a large

portion of customers were not making use

of a lot of that complexity right there

are a couple of common problems that you

have right away no matter how large your

application is when you start using

distributed services and even containers

on a small scale and those problems tend

to be observability rule whoa my fault

sorry those problems tend to be

observability security and performance

right you want this thing to be fast you

don't want it to introduce any sort of

latency right and that's why you see the

proxying components written in rust

they're very small they're very scalable

they're very fast and you see the

features that are focused on which is

you get service level metrics you can

see what failures are happening which

exact service calls might be failing and

you get a lot of visibility into the

session layer without having to encode

that into your applications the other

thing that you get automatic TLS right

that certificate management ssl upgrades

right that seems to be a very common

feature that has a lot of value and the

linker DV to use case is centered around

service owners right so it's very easy

to incremental e adopt either by pod

right or by service and you don't have

to deploy it across your entire platform

which means if you're not the only team

using your kubernetes installation right

you can use linker D and talk to other

non linker D services and you can do

that incrementally and you can do that

easily so it's really centered around

what are the things you probably need

first and how can we make that easy to

use but the downside is because it's

zero config because it just works easily

out of the box if you need a lot of that

deep powerful configuration if you need

features like custom routing right some

of the things that are in the heavier

weight solutions it's not they're so

easy to get started but maybe not as

powerful as some of the others and

there's a there's an important point

there is that the

if you go to a dog for example sto there

is not so much tuning around Easter

today

you know it's if you do if you add let's

say conflicting rules in Easter like if

you do like a router or this in a

virtual service rule in Sto that's going

to conflict with a network policy in

kubernetes nothing is going to tell you

that you know and so someone that

controls the network in kubernetes

through network policies may have said

namespace a cannot talk to namespace B

right and then still in your SEO control

plane or in your configuration you're

assuming that service a in namespace a

can talk to service B so so so that's

why we make the point of you have to be

really cautious about the decision of a

service match so if you move to v1 and a

few months ago right so the tooling to

facilitate investigation of problems in

the usage of the technology itself it's

not there there's not a tool today that

will tell you exactly why a

communication did not go through you

know is it because of network policy

that blocked is it because of a specific

routing policy that was blocked so so

all of that used to have to kind of

develop something yourself or do some

manual investigation right of course we

believe that the tooling will get there

right it's just a matter of where we are

right now in the evolution of the

technology right and I think that's an

important point right focusing on where

you are what problems do you need to

solve most and how much complexity is

that worth to you right again all of

these tools have a learning curve

there's complexity and maintaining them

upgrading them understanding what

happens when those tools go themselves

go wrong right so again we're just

looking at what are the different

philosophies where does each type of

solution help you and so hopefully you

can use that to figure out which

solution is right for you well then the

question comes up all right well that's

a nice look at those four what about

other service mesh options

and I think what we've seen lately is

the number of additional solutions

whether it's Aspen mass or Kong right or

a number of other vendors that are now

playing in the service mesh space an

engine ax shirt there as well yeah and

so the idea really is where we only have

35 minutes today so we're trying to

cover as much ground as we can we're

looking at specifically what is mostly

used in the cloud native ecosystem but

again I think the things that you should

look for are hints around which problems

they solve and look at does that match

the pain that you are feeling with your

applications today and and with that I

think we want to let's say summarize

this presentation by giving you a list

of questions that you should be asking

when you need or if you need to make a

decision on whether or not service mesh

is for you and which service mesh

technologies for you or even can you do

the same needs or can you address the

same needs with less technology right

because if you take ECU for example I

mean the default is do let's say package

comes with comes with prometheus comes

with grow fauna they just added key ally

so you will be bringing a set of tools

that bring more complexity so that has

to be thought through you know so you

should be asking yourself the problems

that you need to solve today can I solve

them without a service mesh is the

solution sustainable if it is and if you

can then why the extra complexity and

I'm a big believer in service mesh but I

am also a big believer in architectures

that make sense right architectures that

are created in a way that they are

sustainable and scalable and not just

again making this a fashion industry

right I think you have some that you

also have some thoughts about yeah

absolutely and so I let's let's let me

say this not to be very negative about

not meeting a service smash

right I can tell you if you are using

containers if you are using kubernetes

you definitely have some of these

problems right and one of the first

problems that you have is what's

happening at the session layer right

when I'm making calls to other services

if they fail do I know they're failing

right do I know why they're failing can

I even tell that they're failing and I

think all of the different service mesh

options help you solve at least that

problem right some of the other problems

you might have again around managing

security or consistency or sub

permissions around particular services

you should take a good hard look at what

are the challenges that are facing you

right and which problems hurt the most

based on that apply those to the

different solutions and you might have

an idea of where to go from there

right there are some other tactical ones

like which platforms you need to support

right what functionality you already

have but I think a good one that we

haven't talked about yet is who owns

your services right is that in the hands

of developers do they need to mostly set

permissions around how these services

are configured what they're talking to

and maintain control of that service

communication or is that platform owners

right do you own your kubernetes

installation or to somebody else and I

think one of those questions helps you

figure out at least what type of

configuration you want do you want

everything centrally managed do you want

distributed control over those things

and so hopefully you can start using

this list to answer some of those

questions for yourself I remember that Y

Red Hat I was doing an interview of a

few customers that were interested in

service smash and they would see the

list of capabilities that a service

deliver and they would get excited and

then I asked are you ready today to have

two different versions of the same

application running so you're interested

in a B test does your CI CD pipeline

support having two different versions of

the same application running at the same

time well the answer was no we we're not

ready to do a B test right so and then

yet they were interested in going into

the service match without let's say

fixing the CI CT pipeline first right so

there is a

lot of this needs to be thought through

and this is just one example that came

to my mind that I think it's it's very

important yeah so a lot of a lot of

things look really good on paper but

again it's a question of what are you

really going to use what is critical to

you right now and maybe that can inform

your decision so with that we're gonna

leave some more resources as well

there's a write up that is on here that

is a much longer version of this talk

with a little bit more detail around

those questions and additional places to

probe a couple of introductory books

both to the service mesh and to SEO

there's also a great talk which is four

reasons why you need SEO which is

actually looking more at the distributed

systems problem and so if you have other

questions we'll have a little bit of

time for Q&A

I think still yeah and of course we'll

be around and you can always reach out

to us online so thank you so we have a

few minutes if you have questions you

can ask as I know you know Cisco

announced a network service mesh yes I

Isis that it's just Vista problem was

the

network and Cheney yeah to note actually

he can fist fist connection and and

connection the tendo for the application

for the application ya know for network

application just my opinion what do you

think I mean I think there's there will

have to be integration between Sdn

providers whatever yes then provider it

is in service smash technologies right

because when network policies which is a

great technology was thought through it

kind of a little bit Kemp income can be

conflicting with with service mesh I

think another way of seeing this is who

created the technology right so the

technology created the ecco technology

was a lot of contribution from Google

and Google essentially likes to use what

they call flat network where all

applications

sorry where all computers in the network

can should be able to communicate with

all computers but applications not

necessarily so the network access is

available but then there is a

certificate between every single

application that needs to talk to so

it's a it's a different way of thinking

about so you like the way Google will

think about is like the network is open

all ip's are available but you actually

protect the application right and the

way network policies was drawn us

thought through is more like networks

are that you have a firewall you in your

firewall you decide which app which

services can talk to a please what

services so I think there's there's this

integration that still needs to happen

in the technology to make sure that the

rules of your network will be consistent

with the rules of your service smash and

it's it's technologies it's doable so I

think I think it's not there yet but it

will have to get there you know

customers will need that they already do

need that yeah thanks for the question

yeah hi we're actually building a multi

cluster infrastructure so my question is

how to set up the pot level or service

RP level connectivity with a few native

solution do you have any clues on that I

think I think if you want to to

introduce service smash to that there's

a few things that are being thought

through at least in the Easter

with-with-with zone gateways which is

just like border gateways that have been

used forever you have a border gateway

between the multiple clusters and you

share this space where the services are

registered right so in a multi cluster

if you want one service in a cluster to

talk to a service another cluster

both clusters have to know what services

are available right so the solution has

to be a centralized service discovery

and there are great solutions for that

Council for example from Ohashi Karp

it's a great solution for that but then

I think you have to think about doing

egress and egress

properly and I last I checked sto group

was doing work on the Gateway zone Gate

race or multi class circuit race so to

answer your question in a link or D

world there's just a simple proxying

component that you put in front of every

pod right and the interface for that pod

is that is that proxy right it's

transparent to every other endpoint so

as long as you expose it right you can

talk to any other service you can

connect it it's a very simple approach

right but it's one that's very easy to

compose I think we have two minutes we

have time for maybe one more question

this the question was have we tried

console Connect I have not tried console

Connect yeah I have myself not tried

console Connect i I have I sense a

diminished investment in that technology

since so was that a question or a

statement that's kind of a statement

that I'm not so sure about

oh I can I can I because I have very

strong opinions on that well the

question was do we see a world where

service mesh technology will get

consumed into API management solution I

I see the other way around and that's

why I say I have very strong opinions on

that is that if the IP I manage your

solutions don't start thinking about it

they business their business are gonna

go away because rating and limiting its

there right so you had a decent

developer portal with billing on top of

a solution that already does rating

limiting that's somewhat of okay of API

management solution right and I say this

is because the folks at at least that I

know their work and API management

solutions at Google they are the same

kind of group that I also work on

service smash so I think they will

become the same thing or they will cease

to exist or our API manageable cease to

exist that's a very strong opinion on

that and with that I believe that brings

us to time so thank you very much for

your time can take questions outside

thank you

[Applause]

Loading...

Loading video analysis...