What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson

By RabbitMQ Summit

Summary

Topics Covered

Minimize Connection and Channel Count
Separate Publisher and Consumer Connections
Keep Queues Short to Avoid Overload
Tune Prefetch for Even Consumer Load
HiPE Boosts Throughput at Startup Cost

Full Transcript

I walked down in or fell down into the rabbit hole and 14 when I started to work at cloud AMQP where we provide now it doesn't work

I used a system where we provide RabbitMQ as a service my goal when I started work at cloudy and QP was that I wanted to make it make it more easy for

everyone to get started with rabbitmq I wanted to bring simple use cases to the public and also point out all the benefits of using rapid mq so I started

to write technical documentation update code examples write blog posts more blog posts tutorials I answer questions at

Stack Overflow and I also created two ebooks so today sometimes when when I'm at a bar having a beer or when I talk to

unfamiliar people I rollin ironically call myself an influencer and the life of a blogger and an influencer of this kind is almost as glamorous as you might think

you get spoiled with spoiled whale all kinds of luxury and you gotta hang out with important people and I don't know if you can see it but it's a dog trying

to disturb me because I work from home a lot I have during my time at cloud mqp replied to more than 3,000 emails about

rabbitmq and I have as you Anya said been part of the urgent support team where we respond to urgent support issues which could be when when a client needs attention directly when a server

is running out of memory or when rabbitmq is under heavy load with all this writing many things were documented I met up with different customers to see

how they are using rabbitmq I and I wrote down lots of common use cases we collected and patents errors we see in setups configuration mistakes and all sorts of common mistakes that we see

things that can go wrong and things that works out well so what kind of issues are we dealing with then first of all we have client-side problems where users

like you and me or client libraries are using rabbitmq in a bad way and then we have situations where things are just not done in an optimal way a way and

then of course we have this server-side service running old versions misc of your servers or when the setup of the server is not configured for a selected use case so I will today spread the

knowledge and talk about what we have learned from running thousands of RabbitMQ notes but before we go into all the things I have already written down all over the Internet

my name is Louisa and I'm from mu Sweden I I do work at 84 codes which is the provider of cloud AMQP I work as

marketing manager and support engineer and lots in between I have a growing family of lovely colleagues and many of them are with me here today and they are

always happy to talk so come by and talk to us in our booth some stay downstairs 84 codes is also the provider of three other services cloud Kafka which is

Apache capacity service elephant sequel Postgres sequel is a service and the hosted message broker for a IOT which is

named the cloud amputee team we work remote from all over the world and we also have customers from all over the world but we have our headquarter in

Stockholm we are today the largest provider of managed rabbitmq servers with lots of run tens of thousands of

running instances in seven clouds in 75 regions and I will now give some recommendations where some is equal to 16 and I will also give a summary of all

these recommendations in the end of this presentation and i know i will repeat lots of things that has already been said today but i did not really have

time to rewrite my slides so recommendation number one try to keep a connection and channel count low each connection uses about 100 kilobytes of

RAM and of course even more if TLS is used so Tao sense of connection can be it could be a heavy burden on or on a

RabbitMQ server so especially if you're running on a small instance and believe it or not connection and channel leaks is one of the most common errors that we see

but luckily it's mostly unstitching and that servers number to make sure you don't open and close connections or

channels repeatedly doing that give you a high eleven see as more TCP packages has to be sent and received and the handshake process of an AMQP connection

is as mentioned before quite involved and requires at least 70 to be packages and again even more if TLS is used Arab time here is optimized for a long-lived

connection so keep connections if you are able to and we keep them open and reuse them if you are able to channels can be opened

and closed more frequently but even channels should try to be long-lived if possible and bestest to not open a channel every time you are publishing

each process should ideally only create one TCP connection and use multiple channels in that connection for different threads and we deal with servers that are under heavy load due to

opening and closing of connections almost every week some clients can't keep long-lived connection to the server

and this has as I said before an impact of latency one way to avoid connection shirin is to use a proxy that post connections and channels for reuse and

we have developed an AMQP proxy for this and our benchmarking show that that proxy is increasing publishing speed

with a magnitude or more and there is a link to the github repo in the slides we are developing in many different

languages and this practice they are locked in Crystal which we are also really proud sponsor of number three one should always separate connections for publish and consume

first of all imagine what will happen if you are using the same connection for publisher and consumer when the connection is in flow control the flow control connection a flow controlled

connection is the connection that is being blocked and unblocked several times per second in order to keep the rate of messages at the speed that the rest of the server can handle a publisher and consumer on the same

connection might worsen this flow this flow control since you might not be able to consume messages when the connection is being blocked secondly RabbitMQ can

apply back pressure on the TCP connection when the publisher is sending too much too many messages to the server and if you can see you assume on the same TCP connection the server might not

receive the message acknowledgement from the from the client so the consumer performance is affected too and we lower consumer speed the server might be

overwhelmed after a while I will start to speak Swedish I said Swedish and if you know something about Sweden accept Swedish Vika then it might be that we

love queueing what we don't like is large queues or when people in some way try to squeeze the same before you in the line so recommendation number four

comes straight from my heart it's use your use case try keeping your cue as short as possible a message published to an empty queue will go straight up to

the consumer as soon as the key receives the message and of course a persistent message also will be written to disk it's recommended to have less than ten

thousand messages in a queue and many messages in a queue can also put a heavy load on the RAM usage and in order to free up RAM rabbitmq start flushing or

page out messages to disk and this patient process usually takes time and blocks the queue from processing messages when there are many messages to

page out another thing that is bad with with large queues is that it's time-consuming to restart a cluster with many hues since the index has to be

rebuilt and it's also it tells us to it's also time-consuming to sync messages between nodes in the cluster large queues it's also very very common

error that we have for our that our customers have a queue is just piling up due to missing consumers or due to that clients are publishing my messages

faster than the consumer is able to handle the messages and then eventually the server is overloaded and killed and

when this happens we usually have more power to those machines and but it still takes time to restart the cluster because yeah because of the rebuilding

of the index etcetera it's sometimes recommended for applications that often get hits by by spikes of messages and where throughput is more important than

anything else to set the max length on the queue because this keeps the cube short by discarding messages from the

from the head of the queue number 5 a feature called LASIK you saturday in

RabbitMQ 3.6 and lasik you writes messages immediately to disk and so spreading the work out over time instead of taking a risk of a performance hit

somewhere down the road messages are only loaded into memory when they are needed and thereby the RAM usage is minimized but the throughput time will

be longer will ASIC use so LASIK use gives you a more predictable and smooth performance care without than a sudden raps but at the cost of a little

overhead and if you're sending many messages at once like if you're processing batch jobs or if you think that your consumer will not keep up with

the speed of the publisher all the time then we recommend to enable LASIK use and we think that you can ignore LASIK use if you require high performance or

if you know that your Q will always stay short due to max Ling's policy or something like that number six the rabbitmq management interface collects and calculates

metrics for every Q connection and channel in the cluster and setting rabbitmq management's that tastes a great mode to detailed could have a serious performance impact and should

not be used in production if you have thousands upon thousands of active Q's or consumers recommendation number seven

accuse are single threaded in RabbitMQ and 1q can handle up to fifty thousand messages second yeah you will I'm loosing my voice

you will never get better performance if you split your cues over different course and notes and if you're out messages between multiple multiple cues

and cues are bound to the node where they are first declared so all messages route to the specific you will end up on the node where that he resides and you can of course manually split cues evenly

between nodes but the downside is that you need to remember where that cue is located and we recommend two plugins that can help you if you have multiple

nodes or or a single node cluster with with multiple cores and it's the consistent hash exchange plug-in and

rabbiting view sharding the consistent aspects change plane has been mentioned a lot today and the plugin allows you to

use an exchange to load balanced messages between queues so messages sent to an exchange are consistently and equally distributed across many bounded

queues and it could quickly become hard to do this manually without adding too much information about number of queues and their bindings into the publishers

and note that it's important to consume from all queues bounded to the exchange when when using this plug-in the

sharding plugin and every attempt you sharding thus the petitioning of queues automatically for you so once you have defined an exchange is charted the supporting hues are automatically

created on every cluster node and message messages are shorted across them and sharding shows one queue to the consumer but it could be man accused

running behind it in the background and then we have priorities hues can have zero or more priorities and behind the scenes of a priority queue a new backing

hue is created so each priority level uses an internal cue on the airline virtual machine which takes up some resources and using 255 or even

thousands of priorities means you will have resource usage usage similar to having close to that many cues and in most use cases it's sufficient to

have no more than five priority levels this is fixed in revs and q3 points seven point six the max priority cap for

queues is now a force consist to 255 and applications that rely on a higher number of priorities will break and such application must be updated to use no no

more than 255 priorities because we had like two weeks ago or maybe three weeks ago a case where two consumers where you were starting up for every time humanity

took real a long time and memory usage just exploded as you can see there and this was despite few cues and few messages which is the common common

error when it takes a long time to restart a broker but this time it was due to many priorities led priority levels

everyone needs to be prepared for brokering starts broken hardware failure or server crashes and to ensure that messages and brokered definitions survive restarts we know - we need to

know that ensure that they are on disk and messages exchanges and queues that are not durable and persistent are lost during a broker restart so make sure

that your queue is declared as durable and messages are this are sent with delivery mode persistent and remember persistent messages are heavier as they

have to barista has to be written to disk so for high performance it's better to use transit messages and temporary or

non non durable queues and then we have the prefetch which is used to specify how many messages that is sent to the

consumer and cached by the Rev dump you client library how many messages the client can receive before acknowledging a message and it's used to get as much out of the consumer as possible

and rub dump you default prefetch setting gives clients and unlimited buffer meaning that rabbitmq by default send as many messages as it can to any

consumer that looks ready to receive them or accept them and messages are sent or cached by the as I said the Rev Thank You client library in the consumer

until it has been processed so not setting in prefetch can lead to clients running out of memory and makes it impossible to scale out with more

consumers in rabbitmq we got everything here 3.7 we got a new option to adjust the default prefetch value and this

value is by default set to thousand on on all new cloud and Kewpie servers we

will russian trip to 3.7 or or higher or it doesn't really exist yet but soon a to small prefetch count may hurt performance since its most on the top of

the time waiting for permissions to send more messages and this figure is illustrating along idling time in the example we have a prefetch setting of 1 and this means that rabbitmq won't stand

out and message until the round-trips completes where were around rapist deliverer process and acknowledge and we have in this image and a total round-trip time

of 125 milliseconds with a processing time of only 5 milliseconds so too low well you will keep the consumer in a

consumer idling a lot since they need to wait for messages to arrive a large prepaid count count on the other hand I

could deliver lots of messages to one single consumer and keep that consumer base a while other consumers are held in an idling state and in this image we

have one client that is has a lot to do and one is just waiting and there's

nothing to do so if you have one single or few consumers and our processing message quickly we recommend prefetching mandamus its messages at once and try to

keep your client as busy as possible and if you have about the same processing time all the time and the network behavior remains the same you can use

the total round-trip time / processing time on the client for each message to get the estimated prefetch value for

each message and if you have many consumers and short processing time we recommend the lower prefetch value then for a single or a few consumer and

finally if you have many customers and/or long processing time we recommend setting prefetch count to 1 so that

messages are evenly distributed among all your workers and here's nothing that you should remember that's it's that your client if your client I'll track

messages the prefetch well you have no effect hype increases servers throughput

at the cost of increased startup time so when you enable high RabbitMQ is compiled at startup and this throughput increases with 20 to 80 percent

according to benchmark tests the drawback of hype is that it's that startup time crease is quite a lot to 1 to 3 minutes and if therefore not recommended if you

require high availability due to this long startup time we don't consider hypest experimental any longer 6% of our

clusters has hype enabled and we haven't seen issues we've had probably really long time acknowledgments let the server

and client know if the message has to be retransmitted again and the client can either act the message when it receives it or when the client has completely

processed the message so pay attention to where in your consumer logic you are acknowledging messages a consuming

application that receives essential messages should not acknowledge messages until it has finished whatever it needs to do with them so that unprocessed

messages don't go missing in case of worker crashes exceptions etc an acknowledgement has a performance impact

so for the fastest possible throughput manual action be disabled publish confirm is the same thing but for publishing and the server acts when it

has received a message from the publisher publish confirm also has a performance impact but however want you to keep in mind that it's required if

the publisher needs messages to be processed at least once thanks to the RabbitMQ team great improvements are made all the time which

is really great in 3.7 we go to the default prefetch as I just mentioned and this will probably completely remove cases where the consumer has been killed

you to to a to large message delivery due to an unlimited unlimited prefetch value and in the ADL we host message stores are now available and this helps

us a third cloud and Kewpie a lot because we have two subscription plans that are shared plans which gives a customer a single we host on a

multi-tenant server and this means that our share plans can be even more more stable and even if you have multiple be

hosts on a dedicated plan it will also be more stable RabbitMQ 3.6 had the new feature lasik use that gave many of our

customers a more predictable and stable cluster and we found lasik you features so good that all our new clusters with

web thank you version 3.6 or larger our has lasik use enabled by default many of our customers with issues are running

old versions or an or documented unstable versions and 3.6 I had many memory problems up to version 3 point 6 point 14 and three point five point

seven was good but lacks some good features like like the lasik you feature and we still have lots of servers running 3.5 and this is the image is a

rep time cube version distribution between cloud and QP customers and it's nice to see that so many has upgraded to 3.7 because it's always something we try

to push to like we want our customers to run old stable new stable versions and

we always test new versions before that we set them as available versions in cloudy and QP so the the menu that is

selected in our drop-down menu in on the page where you select which version you want around that that that is the version we recommend at the moment so

this recommendation is stay up to date we think is happening rabbitmq and say a use a stable version and also a stable air language and and a stable client

library version some plugins might be super nice to have but on the other hand they might consume a lot of resources therefore they are not recommended in

production servers so make sure to disable plugins that you are not using an example of a plug-in that we are using a lot but that we are disabling every time every time we are finished

using it it's the top plug-in which we are using when we are troubleshooting rabbit and revved MQ servers for our

customers number 15 even unused Hughes takes up some resources queue index management statistics etc and leaving

temporary queues can eventually cost rev them he runs out of memory so make sure that you don't leave unused queues left behind and the set temporary accuses out

to delete exclusive or how to expire many of our customers are creating be hosts custome host and then they forgot to add a chai policy and then the newbie

host and which costs message lost during net splits and we also we have a chai policy on all our clusters even single nut clusters

because we're using that when when customers are upgrading upgrading to new versions or if they want to change from two no cluster two three not faster we

use that and if they want upgrade grab them keywords etc and here is a summary

of it all and try to have short queues use long live connection have limited use of priority queues use multiple queues consumers and split your queues

over different course use a stable airline in Rev dump you version and also client library version disabled plugins you're not using have channels on all

your connections and separate connections for publisher and consumer don't set management statistic rate mode

mode to detailed in production delete you a new SKUs and set temporary qs s out to delete it and for those of you

who are interested in recommendation for high-performance then we this is even more important which with short queues and use of max length if you're positive possible

do not use lasik use send transit messages disable manual acts and published confirms avoid multiple nodes

enable RabbitMQ hype and for those who are more interested in high availability enabled lasik use have two nodes and don't forget the HJ policy use

persistent messages to durable queues and do not enable hype and the last one is due to this long startup time we have

created a diagnostic tool that is available from the cloud mq p control panel where customers can validate to set up all right time to set up and get a score of this setup and it's been used

by many customers and it's nice to have when we when we get a support request we can check this one first and then always get back to customer and say you need to fix this this this and this

and then the server is all usually running much better after that and here are example of examples of things that

are validated in diagnostic this diagnostic tool and I think I've talked about man if them but not all of them and just come down and talk to us if you

want to see how sees only if you want to see it as we have seen best practice recommendations are different for different use cases and some

applications require high throughput while other applications are publishing batch jobs that can be delayed for a while and other applications just need to have lots of connections and trade

offs have to be done with between performance and guaranteed message delivery etc and our customers are today able to select number of nodes when they

are creating a cluster a single node for high performance and two or three nodes mainly for high availability and or

consists consistency we also have lots of other features billion into the cloud and cubic control panel like the option

to configure alarms for cooling so for missing it comes humours etc and users can view how many messages there has

been in the queue over time which helps us a lot when we are troubleshooting our servers since statistics for the queues

are available all the time and we also have show metrics for usage like CPU RAM and the disk and we have some seen many

different use cases and there were a future plan shadow AMQP to make it even easier for customers to quickly set up a cluster

specified for a selected use case based on best practice recommendations this is

my final slide and it would be nice if we could have a list like this in a community like a list of recommendations because it's makes it makes it so much easier for

beginners to start using RabbitMQ so if you have any recommendations of things that we need to add or if you have different opinions about something just

let me know or some mini made or reach out to me

thanks perfect thank you let's get

started with some questions under Steve want to join me for questions this is our lead developer and he's also the one who should take a lot of credits for the

diagnostic tool that we have

hi do you do you use which public cloud providers to serving to the RabbitMQ and how is the scale up and scale down in

each cluster not using the docker container or easy to to base the virtual machine or another and how which metric

to using to scale down and scale up cluster notes the customer can select data center that was the first person right they can select the data center

when they when they create cluster they kids do the cashews between an Amazon

Rackspace IBM cloud Alibaba cloud all of them all of them and almost the other

part of the question how we root strap we don't use any

doctor yeah now we as artists now just use like the cloud providers different

type guys to spin up instances and we have all our bootstrapping in bash scripts no fancy container stuff and which

metric to using the scale up and scale down policy cluster notes

yeah so basically the same as for bootstrapping we use cloud providers

api's to spin up instances and bootstrap them with our custom scripts and add

them to the rapid cluster remove the smaller old notes so like rolling adding new notes removing old ones

hi one question do you're just doing everything with just with best scripts stand-up no technology like a POS H or something like this behind of this like

kuba needs or docker or B oh Sh now bash just bash okay thank you for the talk that was awesome

um I wonder if you have a strategy to help customers to keep up to date like if you handle part of the upgrades if you provide tools or anything else the

customer kit we have a in our control panel we have a simple button where you press upgrade and we send out information like now now it's a new rep

time question it's good because of this this and this and whenever we can we do that without downtime so it's a if it's

a patch upgrade we do it node by node but if it's if it requires

downtown done we notify you beforehand we put on the slide about high performance you had a bird and the slide

about a che you had like a mound what is that mound because it's Manny

ants yeah it's an anthill and the bird is flying high that's the reason it's

not something I've been thinking about a lot so thank you for the talk would you say that private until three seven is

more stable than three six fourteen and later please be honest RabbitMQ three seven is more stable than

three six I would say yes the early early versions of three six we had a lot

of problems with the current 360 version and all the three seven versions have been working really good and the basic Hughes was this it was a really good

feature for us by no questions yes

final question what's the downside of

using hippy with a che using hi okay

the downside of using hype mainly when I know that has to come back online after

not split or something it can take quite some time to do it with the hyper naval and it takes really long time if there is already a lot of messages in the

queue and then if you add hype on that it takes much longer time okay thank you very much Lisa thank you

[Applause]

Loading...

Loading video analysis...