Principles Of Microservices by Sam Newman
By Devoxx
Summary
Topics Covered
- Model Services Around Business Domains
- Automate Relentlessly for Scale
- Hide Databases to Enable Independent Evolution
- One Service Per Host Prevents Cascades
- Circuit Breakers Isolate Failures
Full Transcript
uh so I hope you had an awesome keynote I got here a bit late but I hear it involve robots and jetpacks and so one of the guys while wiring me up said good luck following that which is just what you want to he before speaking to a
large audience uh but thanks for coming along today unfortunately the conference looks awesome and I'm literally here for this talk because I am bad at scheduling
so hopefully if my flight is taking off from Brussels depending on what luander is doing I will be shooting off straight from here uh my name is Sam Newman I work for a company called thoughtworks
if you don't know what we do you can email me or look us up on the net because we're on the internet now um and I'm also the author of a book called building microservices if you enjoy the
talk there are copies of the book available at the O'Reilly stand um which you can find later on uh but we're actually here to talk about these things because they are all the rage uh the
only thing um more bu wordy in 2015 then micr Services is of course Docker um I'm sure lots of you have Docker stickers on your laptop I'm sure some of those stick
people with Docker stickers on their laptops have even run Docker um but this is what we here to talk about and these are microservices I draw them as hexagons because it's a nice shape and
they're my slides and I get to pick the shapes uh but also as an homage to uh alist Coburn's paper on hexagonal architecture they have Nam Nam uh
customer service shipping inventory these these give you an idea what the architecture might be about and this is the definition I use for them uh small autonomous services that work together
modeled around a business domain I normally say small independently releasable uh services that work to get that modeled around a business domain you might have your own definition of
microservices and that's great and then you can put that in your own book uh but this is what I say they are they are separate processes they communicate over a network Port but they're all actually about independent Evolution they're
about being able to make a change and deploy them into production by themselves and so I spent a lot of time working with organizations because I come from a background of having spent a
good L chunk of the last 10 years of my career working with service oriented architectures um so I see microservices as nothing more or nothing less than an
opinionated form of service oriented architectures and so I was sort of intrigued as to why microservices worked and what it was that organizations had
to do to make these things work well there are a lot of downsides that come from microservices there a lot of complexity that we add how do you sort
of chart a PA through all around the pitfalls and get the valuable stuff out of it um one of my uh colleagues James Lewis sort of talks about microservices
in the context of an architecture that buys options for you that is to say you invest in having these smaller finer grained architectures and in exchange
for which you get the ability to make lots of different choices and choices can be good but when we come from a background of working with more monolithic software often we only get
used to making one or two major decisions a year we have one main technology stack we use for that mod ethics system we maybe only have one
type of persistence um store maybe only one typee of main idiomatic um design used in that system with a microservice
architecture you get to make a lot more choices and this can actually be a source of a large amount of friction as always when you make
decisions if you sort of approach every single one from scratch thinking about the pros and cons it can become a bit exhausting to go through this the whole
time and also can lead to situations where you end up sort of making different decisions in similar situations and you end up with a whole amount of sort of inconsistencies in your
architectures quite often we use a set of framing principles to help guide our decision- making like a set of value statements that sort of decide how we do
something round here so for example Heroku have their 12 factors the key thing about heroku's 12 factors these are principles that guide decision making when working on the
Heroku platform it's actually you know all these sets of principles are to achieve some goal so this stuff when you follow it well hopefully your application will work well on the Heroku
platform there's actually a mix of principles so design decisions and constraints the constraints of the Heroku platform itself but nonetheless when you're building a platform on Heroku a system on Heroku this guides
your decision- making this set of principles this was a set of principles done by a colleague of mine Evan butcher um and sort of the things we talk about in terms of architectural
principles typically are sort of what you see in this Central column um but Evan really highlighted the fact here that these principles these things that drive how we're going
to design our software they exist for a reason and here they exist to drive the company forward so here on the sort of um leftmost column we've got a description of what the organization is
trying to do this is an organization that's trying to go fast they're trying to expand rapidly into new markets the architectural principles therefore are about going fast there's much less
emphasis on being consistent um there's much more emphasis on actually just empowering teams and then over on the right Evan has pulled out as distinct from
principles these idea of practices these are mechanisms by which you implement a principle ever made the observation that your where your company is going doesn't
change change that often maybe once a year once every two years our architectural principles they change a bit more we learn stuff we realize that some of ours weren't great and that that
maybe modifies on a on a slightly more frequent cycle and then over here on the right the practices the things that are actually the detail they change quite a
bit because technology changes all the time but nonetheless by breaking these things apart this actually allowed this sort of large is organization they now have over 200 developers to more or less
have a good sense of how things are done around here and they also make sure these principles are driving towards an end goal this end goal being this company being successful the 12 factors
for Heroku have an end goal your application should work on Heroku so when I was doing my research into micro Services I was thinking one of the things that organizations do in order to
achieve their end goal which is namely they get enough of the good stuff out of micro services for it to be worth while so what are the principles that we need to follow to build these things these
small autonomous services that work together so I've sort of distilled it down I mean sort of there an earlier version of this in the book this is of a newer version of this distill it down to
eight principles the first is modeling things around a business domain because we found that that gives us more stable apis embracing a culture of automation
to manage the fact that we've now got a lot more Deployable units hiding implementation details to allow one service to evolve independently of
another decentralizing as much as possible both sort of decision- making power but also architectural Design Concepts deploying independently
probably the most important principle up there the idea that you can make a change to a service and deploy it into production without having to make changes to anything else
consumer first Services it turns out exist to be called and maybe we should think about that so thinking outside in not inside out isolating failure making sure that
the systems we build are not you know sort of more flaky than their monolithic counterparts which is very easy to do and making sure our systems are highly observable making sure it's easy
to understand how they hang together and how they behave so let's dive into the uh first principle modeling things around a business
domain I said earlier that I draw these things you know as hexans the more important thing are the names they have names that have meaning when you look at an architecture for a microservice
system you should have some idea of the domain in which it operates compare that to a lot of the architectures that we saw coming out of the service oriented architecture where people took the sort of horizontal
technical layers within a process bound laary and just said right they're going to become new services and so we ended up with presentation Services we ended up with sort of business logic services
and backhand data storage Services the nice thing about those architectures is that you can use exactly the same architecture diagram for an oil rig a banking system or a charity because it's the same diagram it's not very useful
though because often when you want to make a change those systems that have been split horizontally a change often has to cut all the way through something as simple as adding a field to a user
interface may require changes in two or three services and when those services are owned by teams that's coordination across teams with MOS Services instead of
slicing things horizontally we're sort of slicing things vertically the unit of decomposition is effectively the business domain we found that Services modeled
around a business domain are much more stable that is the apis themselves don't tend to change fundamentally that often changes across service boundaries are expensive so we want to avoid them we
also find that we've by exposing these finer grain seams it makes it easier for us to create different sorts of user interfaces because we can recombine the functionality in different ways for a
mobile device for a web application and also that teams that own these Services become experts in that part of the business domain rather than becoming an expert in some arbitrary technical
decomposition of the whole we now get teams that are really understand how invoicing Works understand how the accounts process works finding these seams and existing
monolithic systems can be difficult um but there's a lot of work from domain driven design that can help us here in many ways the same principles that apply to modu decomposition from the 70s still
apply um but taken with a health healthy dose of uh domain driven design thinking as well helping us look for things like bounded contexts and subdomains you can
actually find the service boundaries so implementing domain driven design is a good place to start if you're interested in using this as a way of understanding the domain you're operating
in let's talk about our next principle of embracing a culture of automation you need to be pretty Relentless about this if you're going to
use microservices at scale at the moment when you're starting on this journey having a small number of services you could probably get away with manually provisioning of machine manual
deployments that won't last so let's talk about one a client of ours who been using mic services for many years and we're using it before we had the word for it um this is a company
called Rea in Australia they'd actually spent a couple of years investing in a deployment platform uh on Amazon primarily to allow them to cheaply provision test and Dev environments but
they already had a fairly good set of rigor and discipline around automation but they wanted to go a bit further and embrace this sort of Amazon idea of the two Pizza team Services being owned and
operated by teams that team deploys the infrastructure deploys the service manages that service in production and then actually tears that service down when it's no longer needed so from a
standing start they got three of these services so two of these Services up into production inside three months which is actually I think very good going I think most organizations wouldn't necessarily get that thing
turned around as quickly as they did that went really well for them and they thought right we're really going to go fast now we think we can wrap this up it took them another nine months just to get seven more services up because they
had to invest all the way along in tooling and creating a platform that allows them to do this efficiently all this is about reducing the transaction cost of having and managing more
services and it's not always easy to see what things you're going to need when you start that journey and so actually you see a fairly flat growth here 6 months later they had 60 services in
production the key thing to understand is this is 60 different types of services these Services themselves May further be scaled out so you see this sort of hockey stick explosion in the
growth of services you see similar growth patterns from guilt who have shared their numbers of growth over time that organization you've kind of gone from a monolithic RS application to sort
of a decomposed often jvm based platform you see similar patterns for many years they had a low double-digit number of services but once that sufficient investment in the platform kicked in
things spiked up and so when we're thinking about automation we're thinking about things like infrastructure automation can I write a line of code and provision an isolated operating
system or provision a service have I got sufficient testing in place that makes that helps me understand whether or not I can release my software and am I
treating every check-in as a release candidate have I really got rigor around that stuff all of this is the things you're going to have to invest in if you want to use these things at scale
and be relentless there will be some upfront work required to get this working and that will require ongoing investment as well let's talk now about one of the
trickiest things to get right which is actually hiding implementation details so we're in a very small cozy environment here this is a face safe space there's only one or two or 600 of
my closest friends so I feel confident in sharing with you the world's most commonly used service integration pattern outside of the internet and that is this I have a service that talks to a
database this is okay I'm okay with this this is fine databases are good things but I want to spin up another service and so I do this it's very easy to do
it's very quick to do this now allows two services to share information this is this is very common two Services is not too bad but it is
quite bad if I want to make a change to a schema maybe rename a column because the name is bad maybe restructure a schema to achieve different performance targets can I do that safely knowing
that other parties are reaching in and looking at my database the answer is that I can't do that safely effectively when you expose a database to another service in this way you have exposed
internal implementation details you don't get to decide what is shared and what is hidden and with two Services things aren't too too bad I worked on a platform where we had 40 separate
Services integrated on a schema that we owned we couldn't track down who all those people were so we had to turn the database off during the day and wait for
phone calls right this drastically impacts your ability to change and evolve the design of your systems so this is what we want to talk about if you want to get information from another
service or you want to change data it holds you need to make a request in some way you need to make an API call you need to send a message to it in this way
at the API layer people owning that service get to decide what is hidden and what is not hidden which allows them to change the internals of that system safely so hide your databases this will
be one of the biggest things to get right in allowing these services to evolve independently but even once we've done that even once we sort of got a nice API
boundary we still have to think about what that API I shares as well I mentioned the uh some of the ideas behind domain griven design earlier there's one thing we talk about when we're talking about service design and
domain driven design and that's this idea of the bounded contexts the bounded context is sort of an explicit boundary within a domain and that you have models
that you share between those boundaries and then you have models which only really need to exist inside one of those boundaries this is an example from Martin Fowler's um uh post on on Banner
context so on the left here we have a collection of functionality around sales we have Concepts in there we have territory pipeline opportunity on the right we
have a bunch of stuff about support we have things about defect product versions tickets the nice thing about having diagrams like this again is that you get a sense of what the domain might be which is quite
useful so there are but there are two things that are shared there's customer and product the thing to understand here is that what customer means inside a
sales context is different to what customer means with inside a support context although it might be the same person a customer in sales is somebody I
have sold to or might sell to a customer in the context of support is somebody's raised a ticket so when you're thinking about sharing information you've got to really
understand what do I really need to share what is the information that actually anyone else does care about because let's imagine if these were two service boundaries how I might implement this I have an object which is the
customer it has Fields so I can see information about it maybe the tickets that they've raised and the defects they're there as little fields in the object I run my serializer on it to transform it into a highly efficient and
very human readable format like Jason slight troll um and it runs and follows all the references and creates this nice big Jason payload and I send that over
the wire and it's sent along with it the tickets it's along with it the defects that person has raised and so on and so forth when you expose internals like
that again it becomes very very hard to change exposing information is costly it's easier to expose information you previously hidden and hide information
You' previously exposed so you also need to think very carefully about what is shared and what is hidden that's often what a lot of the bounded context ideas are
about let's talk now about more maybe one more the fuzzier ideas here and that's all you know that's about decentralizing all the things the reason this is important is because
microservices are an architecture which optimizes around autonomy autonomy of teams predominantly rather than individuals but autonomy nonetheless to
do this to achieve that H goal of going faster deploying quickly more quickly into production you have to actually push power out when we thinking about what autonomy is sort of the definition
I use in this context is sort of giving people as much Freedom as possible to do the job at hand and so we need to think what can we do to make our developers make the teams
owning these Services more in control of their own destiny it starts with things like self-service do I have to raise a ticket
to get a machine or provision an environment or can I just do it myself um that's a very simple thing governance is also important
I actually think governance is not a dirty word necessarily having a place where people can collectively come together look at how thing look at the crosscutting concerns understand okay do
our principles need to change um but finding a way for that governance process to be shared as well so rather than having some centralized architect who sits over the whole thing you actually have people that come together
Collective you know members of the team that come together and talk and share ideas some organizations create this with say things like the shared communities of practice uh this is
actually a reference to the slide here is referencing a blog post from Gil from a couple of years ago uh the structure they talk about here didn't really stick for very long but nonetheless it's an interesting example of how you can have
a bit more of a collective sense of governance and ownership but it also comes into our architectures how many people have an architecture like this
hands up anybody a nice simple bus magical bus that communicates manages the Comm iation of all our services and it looks like a nice diagram the problem
of course is that often it's hiding a lot of problems so I have no problems against message Brokers I have no problems with things that get messages
from A to B and do so in a resilient and reliable way a large amount of my uh it career has been spent using such things but I don't like it when those message
Brokers start taking on more and more functionality and more and more Behavior you know IBM mq series was a good queue in 1995 but they kept adding things on
top we do things like make these buses domain aware we use them to implement consistent data models we start putting more and more smarts into this wonderful
Magical Mystery bus in the middle and suddenly to make a change we need to not just change the service but also the message bus itself which is now being managed by a separate team if you're
going to use middle mware if you're going to use messaging middleware keep it dumb keep it about the pipe keep it about going from A to B keep the smarts in the services and this does not just
apply to messaging middleware I think if you look at the current Trend around API gateways they are fast becoming the Enterprise service bus of the microservice era because the reality is
when we look inside these things they look nice on the surface but we know there's some sort of hellish landscape of death and destruction lying just within the surface
so we the halfway point now let's talk about probably the most important principle and that is deploying independently this is the idea that it should be the norm not the exception
that you make a change to a service and deploy it into production without changing anything else if you have five Services right now and you always have to displ deploy all of those five
Services together fix that before you add a sixth you'll thank me later getting this right can require a lot of things but it can often start with even
simple things like how are your services mapped to the underlying infrastructure things like you know how many services per host do you have when I say the word host here I'm really
talking about an isolated operating system and sort of collection of resources that could be a physical machine it could be a virtual machine it could be a container we have the model where I have one service per host for
the model where I have oh I skipped forward too quickly or multiple Services per host over here on the right where I have multiple Services per Host this is the world you'll be in if you'll say
using a Java application container this is where you're using jbos this is where you're using IIs um this is often an approach that's optimizing about having a small number of hosts this is the
world you'll be in if the cost of provisioning a new host is too high so if you only have physical infrastructure or you have to raise lots of tickets to provision a virtual machine the issue is
that world on the right is a world of side effects that world on the right is where I deploy a service it has a bug that uses up all the CPU on the machine and suddenly all the other services stop
working that's where I deploy some prerequisites that a service needs on that machine and suddenly those prerequisites don't they clash with the other services on that box and those other services stop working the world on
the right is more confusing to think about from an operations point of view and doesn't really help us around interdependence you don't have to start with one service per host but I think
virtually everybody I've met who uses microservices at scale where by at scale I mean more than one microservice per developer they end up on the left
because it's a simple world it's much easier to reason about this is partly why people are so excited by Docker and things like it because it lowers the cost of creating isolated operating
environments like this but we also have to think about making changes we do want to avoid breaking other services when I make a change to a service and deploy it into production the key thing I'm asking
myself is have I broken one of my consumers that's often why people resort to releasing all the services together because they say I've tested these 10 Services together I know they work
together so I'll just release them all at once and that process becomes in shrin as the way to do things but that actually slows down um how quick you can get functionality out and makes for
riskier deployments but if I want to make a change to one service in this example I want to change the inventory service the key thing is to understand is have I broken My Consumer so if I
deploy a new version of inventory am I is shipping still going to work in production so the way actually we can validate that before deployment without
having to do large end to- end testing uh and that's using a technique called consumer and contracts if you think about this communication here the shipping service has expectations on how
the inventory service is going to behave the issue is is those expectations are often implicitly modeled there are calls in the application code that we can sort of look through and say okay if we could
distill that down that's sort of the contract that we have but that contract isn't explicit anywhere what we do with consumer diven contracts is we make that
contract explicit and we make it executable so we use so we take the so the consumer team for example here would create a set of tests that represent the
expectations they have of the inventory service those tests are then run as part of the CI build of the inventory team so every time I check in I run my consumer
of and contract test so maybe I bring up the inventory service on my CI node I execute the expectations against it for the various different consumers I've got and if one of them breaks I know not
only should I not go into production but I even know which consumer I broken uh so this is a very good technique we use this quite a lot now uh there were some sort of test tools that
can be jewy red to to support this there are a few concerns that are quite tricky in this area um and so the tool I like a lot in this space is one called packed which is built from the ground up for
this purpose um uh Bethy runs a project she's even got a project called pack broker where you can store the expectations for multiple different versions which means you can validate expectations for multiple different
version of the same consumer before going into production which is actually something you often want to do so it's well worth a look um but this allows you to do independent isolated testing of a
service validate you're not going to break consumers and go into production without the need for big endend testing but the problem is sometimes you
do actually need to break consumers you don't want to do it but you have to sometimes the key thing here is if we want to embrace the idea of
independent deployability we can't force consumers to upgrade at the same time as we we produce a new version of our service API so we have to think about different models one of the models I
like a lot is actually just coexisting the endpoint so I'm going to introduce a braking change and so what I do is I map existing API calls to my version one endpoint this could be a different name
space with RPC I a different port even um and so that's where my old traffic is going I put my braking API is a new version It's version two and I expose
that somewhere else very cleanly identifiable I then give my consumers time to upgrade so once they've made the switch which they could do is maybe a separate release in a few weeks from now
I can then retire the old endpoint I've used this model quite a few times I've even had that one time had three different API Exposed on one service to allow consumers time to
upgrade this works very well in terms of keeping your sort of deployments quite simple keeping service Discovery simple and it works quite well when you've got a a sort of control over
your consumers you have the ability to ask them at some point to upgrade to a new version another model you can use is when you introduce a braking change is
to actually ruce a brand produce a brand new version of your customer service so maybe I've got version one I've got version two and they're serving different consumers that model works well when you
can't change the consumer they just need that API uh the problem with having multiple different versions of a service live at once is those multiple different service
versions are effectively branches in code if I now have to fix a critical bug I may now have to fix it in multiple different places it also can complicate service Discovery I now need to find not
just my customer service but my particular version of a customer service and if these services are also stateless that can be a bit tricky but nonetheless a mix of coexisting endpoints like this
and having multiple different versions of the same service are ways in which you can break an API without breaking your consumers this in a way is sort of
a version of the expand contract pattern once nobody's using an old version of my service I turn it off once somebody's no longer using an old version of my API I remove that
code let's talk now about putting the consumer first Services exist to be called with a user interface I suspect that most of us now are quite
comfortable with the idea that it might be a good idea to have a kind of either a real user or a fake user look at our design and help us tweak and iterate that design some of you may even done
Guerilla testing you know going out there having watching people use your application filming them while they're doing it getting that great feedback apis are the same apis are user
interface their user is just another team it's another set of developers so you actually need to think what is it I need to do to make it easy for them to actually work with my service Services
exist to be called it's very unsexy but one of the easiest things you can do to make your life easy is have
good documentation Swagger is winning and if not has won the battle in this space as a way of defining
documentation for apis uh most web apis you'll use will support exposing the Json for this stuff Um this can be often a very easy thing to do you put a little
bit of information on your endpoints and you can produce nice shiny documentation Swagger can go a bit further for you because you can do things like use the Swagger UI to actually execute those end
points for within your browser if you think about the you know you're you're as a person who's consuming this API you kind want to explore it you want to understand how the payloads work being able to go to a Swagger UI like this see
the documentation see example templates of what to execute actually paste them in change the fields and hit execute and maybe run against a developer version of
that service that's great feedback for actually someone writing a service to consume your API even other things can help too even knowing what's actually running out
there can be useful um many of you may have heard of services Discovery tools I don't tend to like using service Discovery tools very early on cuz I think they're really about scale but nonetheless these sorts of systems give
you information about what is running where I tend to favor console in this space this is something really though designed around having one machine talk to another machine not necessarily often
very useful for human beings but they do expose information that can be useful as a consumer I now could get hold of what's running where then all you need to do is a little bit
of work and actually get that information out and present it in a nice fashion uh a colleague of mine um halard in Australia coin this term of the
Humane registry a registry not designed for other machines but a registry designed for human beings so he started off with a Wiki page if you've got information and documentation about your
service via Swagger I've got actual runtime Dynamic information about my services held in something like a service Discovery system just create a Wiki page for your service and pull that
information into one place as a consumer I go to that page I can maybe even find out even weird things like who should I email when it doesn't work you know I've been in a few organizations where you
can't even you don't even know who created this thing and that's quite scary believe me but I could see the documentation I could see how it's running I could see you know maybe even some
stats creating things for human beings is quite important in the microservice world and we'll touch on this idea of sort of making things easier to understand in a
moment let's talk now about isolating failure um it's an in an unfortunate um misunderstanding of distributed systems that some people
have but they sort of assume that just by breaking up say a set of functionality across multiple machines that your systems will automatically be more
resilient that's actually not true it's actually much easier to make things less resilient well think about it right if your application is running across more
machines machines have a failure rate there are now more machines in your system that could fail there are more Network boundaries there are more Network networks that could partition
that could time out you've effectively expanded your surface area of failure and so unless you've also built your application to handle that failure your system will be less
resilient true story a couple of years ago I was working at a client they taken a monolithic net application they split it up into 12 pieces and they were running this in production and they said
to me Sam whenever one of these Services stops working everything stops working uh A friend of mine has likened this to a distributed single point of failure somebody else talked about it as
being like you've taken your brain and chopped it up into 12 pieces and put it into 12 different jars that system was a was you know had I I my suggestion was to merge it all back together because it' probably be more resilient the issue
is they hadn't thought about what failure means now it's not just other people that do this I also did this um I was the lead on a uh project a few years
ago this was for a classified ads website they worked in multiple verticals like you could buy guitar and a cement mixer from them and so they'd built up all these old Legacy applications to support different
vertical over time and we were working to move these onto a new technology stack and we used a pretty common migration pattern actually a pattern you use a lot if you are looking to move
towards microservices and that's this idea of a Strangler application effectively a Strangler application is something that intercepts calls to the old system um and potentially redirects
them to new uh to the new code and over time you get rid of the old code until only really your new code exists so in this example we're proxying requests to the downstream applications and that was
fine so for the verticals that got less traffic that were less valuable to the organization we were leaving those and focusing on where the money was um round about 10 production nodes
um normally we would have only about 30 to 60 concurrent requests at any given point in time uh per node at Peak but we'd have about 6 to 8,000 requests per
second coming in but most of that was very aggressively cached so these were the buy design cash misses effectively uh the peak was during the day which was very good for me because
during one of these Peaks the whole system went down what happened was the load on these nodes went from having handling 30 to 60 requests to handling
over 800 in space of 15 minutes and when you've got a request equaling a thread you get some idea about what might happen to Circa 2009 era Hardware so the
whole system went down and went down very quick it turned out this is an example of a cascading failure the kind of thing you really need to protect yourself against what was happening was
that one of the downstream Services was failing in the most annoying way anything can fail in a distributed
system and that was it was failing slowly when things fail slowly they tie up resources and in a distributed system they have the potential to tie up
resources across cool chains of therefore you have multiple services that may have resources locked up that's dangerous and that actually took our system down because this thing was
failing slowly the thread pool that we were using to proxy calls became exhausted because all the threads were blocked waiting for this thing to time out the threadpool therefore had no more
workers available which was annoying because although the rest of the downstream applications were working just fine no traffic could get through and because the thread Poole was full up all the requests coming in from the
outside world were still building up and were blocked and were hanging there waiting so that was actually these requests coming at the top were what caused the huge spike in the number of
concurrent requests and it took the whole system down so this was just one of our applications a very old one that one day decided to be slow and this cascaded up and took out our entire
system in 15 minutes that's not good so we fixed this in in actually a few different way ways and these actually ended up I didn't realize at the time a fairly common set of patterns that you'll use to really
you know make these systems more resilient the first thing we did was recognize that the timeouts were hopelessly wrong we were waiting 2
minutes for these Downstream services to respond no human being even in 2009 Waits 2 minutes for a web page to load so people that had requested a page had already gone off and done something else
while we were still waiting on something to time out so we really shorten those timeouts we took them from 2 minutes to about 2 seconds the way you can do is just look at the normal response time percentiles and so we just stuck our
timeouts in a fairly healthy place where like 90 our 90th per our response was fine we accepted that we might be aggressively timing out things but the reality was the application that had
started failing on us generated a very small amount of our traffic anyway so we felt it was more important to keep the whole system running so that's the first thing you know when you're thinking about timeouts what are the timeouts
what should they be so we brought those right down but we realized even then we still had a single point of failure in that thread pole even if we could maybe put those threads back quickly if we
only had one thread pool for all of our Downstream Services we still had the situation where traffic to one application one Downstream application could stop traffic going to others and so we added one threadpool per
Downstream application this is an example of what's called bulkheading it's a very important pattern in resilience engineering the way to think about this is you know
you've got a ship big ship you hit a rock water starts pouring into the hole if you go down into the hold you close the compartment that's flooded that compartment is flooded but the rest of
the ship carries on so in this situation now if one of these thread Poes becomes exhausted the other thread Poes can still serve requests to the other Downstream
applications that's good the third thing we did was we added what are called circuit breakers so the way circuit Breakers work in a networking sent just like they work in your house you get a
surge of electricity that comes to your house the circuit breakers open they stop the flow they protect your appliances here the way the circuit breakers work is after a certain number
of errors or timeouts the circuit breaker opens and requests stop getting sent to the downstream service that gives the downstream service the ability to recover if needed especially if
you've got say exponential back off and retries but it also allows your code to fail fast rather than waiting for a timeout or an error I can say the inventory service is
down that allows you to programmatically degrade functionality so in our case that meant that when a circuit breaker blew open we would actually close off part of our user interface that related
to that verticals or pop up an error message so now not only you keeping the rest of the site running you're keeping information flowing you're be you're giving clear indication to the user what's
happening um circuit breakers are also useful not just for handling unplanned outage they're also good for handling planned outage you know like you've got your fuses in your house before you start drilling in the walls you open the
circuit breakers to stop yourself electrocuting yourself or you're doing a drilling so when we needed to deploy a new version of these Downstream applications we'd flick the circuit breaker open the site would degrade
functionality based on that service no longer being available and we redeploy the new version test it you know and then reset the circuit breaker and so by putting in something that was there to
deal with unplanned outage we also gave ourselves a mechanism to handle near zero downtime deployments all three of these patterns are helpfully described in Mike nigar's book release it if you
buy one book this year buy my book if you buy two books this year buy my book and Mike's book um so it's a really excellent book around residency engineering so this is the stuff you have to think about these three patterns
will come up time and time again but you have to think what happens if every single thing that that I can do fails and I would apply circuit breakers around database connections even as well there are good libraries for list you've
got historics for Java you've got poly and brighter for net um there's probably about 15 different implementations in Ruby some of which may even work so uh just do your research on that one but do
read Mike's book as well one to the last principle now that of making things highly observable um and I don't mean just in the sense of making it really easy for you to just look at all the machines
you've got running this is what I used to do when I used to run production systems I used to have lots of X terms open lots of green on black text it was great I was running top my my favorite things would have top running on all my
machines like I was in The Matrix it felt very oh was great look look what I'm doing I'm supporting the systems um and so if I wanted to check for logs the errors on the logs just every day or two I'd I'd log on to all the machines just
do a GP for errors and see if there's any odd patterns coming up um and that's fine when you've got like six machines and then we got more machines and I was starting to have a real problem I couldn't manage them anymore so I got a
second monitor so I could have more windows open that doesn't scale so well after a certain point in time you end up really needing to move away from this idea that monitoring and observing and
understanding system behavior is about logging into the machine you need to be thinking about Gathering all the information you can out of those nodes and storing it in one place where it can be viewed so we're talking really about
aggregation of all the things start off by getting logs out of this system you know if you've got money by Splunk you've got a lot of money by Splunk it's awesome fantastic log aggregation tool
if you want something to host yourself using the elk stack is great if you want something you know that's off premise you can use um paper trail or or Sumo logic just get all of your logs in one
place makes it very very easy for you to just see what's happening across your entire fleet most of those aggregation tools will also do things like you know reporting on error rates and stuff like that which can be really useful do the
same thing with your stats get things like the response rates off every single one of your machines and so you can look at latency as as well across your circuit breakers get those off those nodes get them somewhere Central um
traditionally you do something like graphite um if you're hosting it yourself I'd go with Prometheus nowadays again you know New Relic app Dynamics any of those systems handle aggregation of stats really in a really nice fashion
with it stapped aggregation what you're often looking for is being able to see things over time so a good time series based system and the ability to drill down with aggregation you want to see
the overall pattern but when you want to dive into what a particular service is doing or what a particular machine is doing you're need going to be able to navigate in so often these systems will come with some kind of query language to
make that possible it's not just doing things like aggregation we've also got to think about what how we how services are connected to each other we've got to make it easy for us to understand what's
happening and how our systems are behaving for example I've got you know some interconnected Services I click a button which calls a service which calls a service which calls a service and all
the way deep in that call stack I get error so an application developer I might have enough information about the call that was made to that cause that
error at the service itself but will I understand the context in which that call happened will I understand all the other things that led up to that error happening what if that
error happened as part of a Long Live business transaction I've now got to work out what's broken as a result do I need to unpick something manually so here we're just reusing an old idea from
event-based systems that's this idea of the correlation ID so when I start some some action I click a button for example I generate an ID that ID flows
Downstream for every subsequent call I record that correlation ID and some information about it so um some people use tools like Zipkin which gives you
tracing of latency across calls I like actually taking these correlation ideas and putting them into log files because then I get a stack Trace I see the correlation ID I put that into my log I
ation system I can now see all the log statements related to all the calls through my entire call stack this stuff is really really useful the unfortunate
thing about correlation IDs is by the time you've got a system complicated enough to need them the effort to put them in is non-trivial because by definition you've got a complicated system and so I actually often start off
by saying we are just going to come up with a convention and we are going to put them in it's going to be a header here's how we generate them here's where we expect to find them in the logs and even starting that off with simple system is going to be useful one of my
clients who was a very um overachieving sort of person uh came up with this really nice idea which he used uh he like you we were talking about his idea of data logs as being like data and so
he wrote a little program that given a correlation ID would actually draw pictures of the services and how the services were communicating you know there's having a piece of documentation that says this service talks to this
service and then there's drawing pictures of it based on correlation IDs coming out of of your live production logs that stuff is just really
useful so let's summarize let's talk about our eight principles again modeling things around a business domain leads to services that have more stable
API boundaries it's easier to reuse them in different ways to different user interfaces and it makes teams that own them experts not just in the domain themselves but also the service ownership themselves and avoids lots of
cost cutting changes embracing a culture of automation is key to allow you to manage all these multiple different services that are flying around hiding implementation details is essential if
you want to evolve the internals of one service without breaking others decentralize things avoid smart middleware see if you can push decision-
making into your team lower the barriers to entry for teams to look after and manage things themselves deploy things independently this is actually the Golden Rule it's the only thing you remember out of this
presentation actually there only one thing is to buy my book if there are two things you're going to remember it's this this is what you need to aim for you need to be able to make a change to a service and deploy into production in
isolation of everything else and if you can do that reliably you're in a very good place place your consumer first it's a very soft thing Services exist to be called think outside
in make sure you understand where your sources of failure are every single communication between one service and another is a potential place where something can go wrong plan for that
understand it and know what you're going to do about it as a result and finally make sure you build your systems to be observable you know building things in like correlation IDs aggregating stuff
in a consistent and standard way is very important uh if you want more information about the book you can find it at building marar services.com there's a bunch of copies I signed that you can buy and there's 40% off um you
can also find links to my blog where I'm sort of blogging about um new patterns and new research I've done subsequent to this but um thank you very much for your
time uh I think we've got some time for questions I can't see anything but you know there's a question over
there you talk you talked about slicing things vertically yeah um when you do that what about the front end do you each service own front end and if so how
do you integrate those or do you integrate on the frontend itself one application yep uh it's a really good question so if you're slicing vertically what you do about the user
interface the the challenge here is that user interfaces are normally fundamentally aggregations of functionality um so I've seen a few different models for organizations who's who are primary delivering over the web
who are doing like old school like not single page app buil type websites there's a very easy easy way to do that which is your service serves up a collection of pages and effectively then
you just use a very very thin scaffolding layer to sort of pull that stuff together um and so that's sort of what guilt do um that's what Rea do so effectively you know I work in this area this is our part of the UI and then you
have to have someone keeping an eye that it's all coming together correctly at the top uh orbits actually use microservices to serve up components in their pages that are then pulled
together um things get tricky around mobile with MO mobile you can't often just be making loads of calls to these backend micro Services because that's very expensive because of battery data
plans and the like so you could do that with a single page app which I've also seen so you make the call straight into the services but when it comes to mobile that's often not efficient and so this is patent called the backends for front
ends idea we effectively have an edge service that is there to handle the serers side communication for certain user interface so you make course grained calls to that back end for front
end that in turn makes the course the micro Services the key thing there with that pattern is that BFF is tightly coupled to that user interface often owned by that team if that teams their
mobile team so uh Rea for example they've got one BFF for their Android app they've got a different B BFF for their iOS application I've got a Blog on that coming out I think next week kind
of a big piece that should go into a bit more detail so I hope that was useful cool question here if you have for example customer service and it's
used by two other services but which which you showed were not related to each other and have different need of customer do you propose to make end points for those different services on that
cice yeah so the question there was if I got a customer service and I've got two different services that need different things from the customer would the idea be here like that the the one is storing some set of information about a customer
and another is storing other information about the customer is that sort of the example so yeah this is this is this problem that with something like the customer is a fantastic example of often
what doesn't make sense as a service in its own right and I use the example a lot in the book which is where it gets confusing um I don't think it makes sense for the customer service to necessarily store all information about
the customer because if you follow the links that could then be all of your data uh the way I like to think about it is um let's think of the British government I'm always thinking about the British government it's part of my
patriotic duty but if you go to the dvla they're the driving license Authority they store information about me my car registration my driving licenses Her Majesty's revenue and Customs stores
information about my tax returns which by the way are late and the NHS stes information about my medical health records now they all have very very different needs I actually like the idea
that they're all in one place get a lot of sense of information what they're working through now is this idea but they still me I still have an identity so this idea effectively that the information stored about me is almost
Federated in those different places for something like a customer service on a very simple system I might be inclined to have most of the information there but over time I think something like a customer service is actually going to
store a very very small amount of information about me and maybe just enough to handle authentication maybe just enough about my identity but then those local services that have their own
needs of your data they might have a pointer to your identity in them but they'll store their local records about you and I think that's a very natural progression you get to when you go beyond say more trivial systems the
customer the customer your user is always a great example of where that thing comes up but I hope that was useful I've probably got time for go
another question about dat ex you change the name any reason what's the proper way to repli
or syn uh so how do you handle shared data um I think very clearly I understand if I've got a piece of
information so if I'm copying data around that's I I use caching for that and so I have information about how often I will refresh that I don't like copying other people's data and storing
in my database because it's not clear to me who owns what and so I would use things I would just not you know put cash headers on the resources I'm sending out allow services to make decisions about whether or not they cash
and request that information and that's sort of how I would handle that if you really really really need things to be consistent then you can't cash and in which case you have to go back to a consistent data source for that anyway
um so that's sort of your your kind of trade-off I am short on time I can't really explain that very very well I do talk about it in the book um anyway I got I've got Christmas presents to buy
okay uh thank you very much for your time um if you want to ping me questions um I'll try and do followup on uh my Twitter handle at Sam Newman but I've now got to go to Brussels and hopefully
get on a lift hands a flight to Munich so wish me luck
Loading video analysis...