What we've learned from running thousands of production RabbitMQ clusters - Lovisa Johansson
By RabbitMQ Summit
Summary
Topics Covered
- Minimize Connection and Channel Count
- Separate Publisher and Consumer Connections
- Keep Queues Short to Avoid Overload
- Tune Prefetch for Even Consumer Load
- HiPE Boosts Throughput at Startup Cost
Full Transcript
I walked down in or fell down into the rabbit hole and 14 when I started to work at cloud AMQP where we provide now it doesn't work
I used a system where we provide RabbitMQ as a service my goal when I started work at cloudy and QP was that I wanted to make it make it more easy for
everyone to get started with rabbitmq I wanted to bring simple use cases to the public and also point out all the benefits of using rapid mq so I started
to write technical documentation update code examples write blog posts more blog posts tutorials I answer questions at
Stack Overflow and I also created two ebooks so today sometimes when when I'm at a bar having a beer or when I talk to
unfamiliar people I rollin ironically call myself an influencer and the life of a blogger and an influencer of this kind is almost as glamorous as you might think
you get spoiled with spoiled whale all kinds of luxury and you gotta hang out with important people and I don't know if you can see it but it's a dog trying
to disturb me because I work from home a lot I have during my time at cloud mqp replied to more than 3,000 emails about
rabbitmq and I have as you Anya said been part of the urgent support team where we respond to urgent support issues which could be when when a client needs attention directly when a server
is running out of memory or when rabbitmq is under heavy load with all this writing many things were documented I met up with different customers to see
how they are using rabbitmq I and I wrote down lots of common use cases we collected and patents errors we see in setups configuration mistakes and all sorts of common mistakes that we see
things that can go wrong and things that works out well so what kind of issues are we dealing with then first of all we have client-side problems where users
like you and me or client libraries are using rabbitmq in a bad way and then we have situations where things are just not done in an optimal way a way and
then of course we have this server-side service running old versions misc of your servers or when the setup of the server is not configured for a selected use case so I will today spread the
knowledge and talk about what we have learned from running thousands of RabbitMQ notes but before we go into all the things I have already written down all over the Internet
my name is Louisa and I'm from mu Sweden I I do work at 84 codes which is the provider of cloud AMQP I work as
marketing manager and support engineer and lots in between I have a growing family of lovely colleagues and many of them are with me here today and they are
always happy to talk so come by and talk to us in our booth some stay downstairs 84 codes is also the provider of three other services cloud Kafka which is
Apache capacity service elephant sequel Postgres sequel is a service and the hosted message broker for a IOT which is
named the cloud amputee team we work remote from all over the world and we also have customers from all over the world but we have our headquarter in
Stockholm we are today the largest provider of managed rabbitmq servers with lots of run tens of thousands of
running instances in seven clouds in 75 regions and I will now give some recommendations where some is equal to 16 and I will also give a summary of all
these recommendations in the end of this presentation and i know i will repeat lots of things that has already been said today but i did not really have
time to rewrite my slides so recommendation number one try to keep a connection and channel count low each connection uses about 100 kilobytes of
RAM and of course even more if TLS is used so Tao sense of connection can be it could be a heavy burden on or on a
RabbitMQ server so especially if you're running on a small instance and believe it or not connection and channel leaks is one of the most common errors that we see
but luckily it's mostly unstitching and that servers number to make sure you don't open and close connections or
channels repeatedly doing that give you a high eleven see as more TCP packages has to be sent and received and the handshake process of an AMQP connection
is as mentioned before quite involved and requires at least 70 to be packages and again even more if TLS is used Arab time here is optimized for a long-lived
connection so keep connections if you are able to and we keep them open and reuse them if you are able to channels can be opened
and closed more frequently but even channels should try to be long-lived if possible and bestest to not open a channel every time you are publishing
each process should ideally only create one TCP connection and use multiple channels in that connection for different threads and we deal with servers that are under heavy load due to
opening and closing of connections almost every week some clients can't keep long-lived connection to the server
and this has as I said before an impact of latency one way to avoid connection shirin is to use a proxy that post connections and channels for reuse and
we have developed an AMQP proxy for this and our benchmarking show that that proxy is increasing publishing speed
with a magnitude or more and there is a link to the github repo in the slides we are developing in many different
languages and this practice they are locked in Crystal which we are also really proud sponsor of number three one should always separate connections for publish and consume
first of all imagine what will happen if you are using the same connection for publisher and consumer when the connection is in flow control the flow control connection a flow controlled
connection is the connection that is being blocked and unblocked several times per second in order to keep the rate of messages at the speed that the rest of the server can handle a publisher and consumer on the same
connection might worsen this flow this flow control since you might not be able to consume messages when the connection is being blocked secondly RabbitMQ can
apply back pressure on the TCP connection when the publisher is sending too much too many messages to the server and if you can see you assume on the same TCP connection the server might not
receive the message acknowledgement from the from the client so the consumer performance is affected too and we lower consumer speed the server might be
overwhelmed after a while I will start to speak Swedish I said Swedish and if you know something about Sweden accept Swedish Vika then it might be that we
love queueing what we don't like is large queues or when people in some way try to squeeze the same before you in the line so recommendation number four
comes straight from my heart it's use your use case try keeping your cue as short as possible a message published to an empty queue will go straight up to
the consumer as soon as the key receives the message and of course a persistent message also will be written to disk it's recommended to have less than ten
thousand messages in a queue and many messages in a queue can also put a heavy load on the RAM usage and in order to free up RAM rabbitmq start flushing or
page out messages to disk and this patient process usually takes time and blocks the queue from processing messages when there are many messages to
page out another thing that is bad with with large queues is that it's time-consuming to restart a cluster with many hues since the index has to be
rebuilt and it's also it tells us to it's also time-consuming to sync messages between nodes in the cluster large queues it's also very very common
error that we have for our that our customers have a queue is just piling up due to missing consumers or due to that clients are publishing my messages
faster than the consumer is able to handle the messages and then eventually the server is overloaded and killed and
when this happens we usually have more power to those machines and but it still takes time to restart the cluster because yeah because of the rebuilding
of the index etcetera it's sometimes recommended for applications that often get hits by by spikes of messages and where throughput is more important than
anything else to set the max length on the queue because this keeps the cube short by discarding messages from the
from the head of the queue number 5 a feature called LASIK you saturday in
RabbitMQ 3.6 and lasik you writes messages immediately to disk and so spreading the work out over time instead of taking a risk of a performance hit
somewhere down the road messages are only loaded into memory when they are needed and thereby the RAM usage is minimized but the throughput time will
be longer will ASIC use so LASIK use gives you a more predictable and smooth performance care without than a sudden raps but at the cost of a little
overhead and if you're sending many messages at once like if you're processing batch jobs or if you think that your consumer will not keep up with
the speed of the publisher all the time then we recommend to enable LASIK use and we think that you can ignore LASIK use if you require high performance or
if you know that your Q will always stay short due to max Ling's policy or something like that number six the rabbitmq management interface collects and calculates
metrics for every Q connection and channel in the cluster and setting rabbitmq management's that tastes a great mode to detailed could have a serious performance impact and should
not be used in production if you have thousands upon thousands of active Q's or consumers recommendation number seven
accuse are single threaded in RabbitMQ and 1q can handle up to fifty thousand messages second yeah you will I'm loosing my voice
you will never get better performance if you split your cues over different course and notes and if you're out messages between multiple multiple cues
and cues are bound to the node where they are first declared so all messages route to the specific you will end up on the node where that he resides and you can of course manually split cues evenly
between nodes but the downside is that you need to remember where that cue is located and we recommend two plugins that can help you if you have multiple
nodes or or a single node cluster with with multiple cores and it's the consistent hash exchange plug-in and
rabbiting view sharding the consistent aspects change plane has been mentioned a lot today and the plugin allows you to
use an exchange to load balanced messages between queues so messages sent to an exchange are consistently and equally distributed across many bounded
queues and it could quickly become hard to do this manually without adding too much information about number of queues and their bindings into the publishers
and note that it's important to consume from all queues bounded to the exchange when when using this plug-in the
sharding plugin and every attempt you sharding thus the petitioning of queues automatically for you so once you have defined an exchange is charted the supporting hues are automatically
created on every cluster node and message messages are shorted across them and sharding shows one queue to the consumer but it could be man accused
running behind it in the background and then we have priorities hues can have zero or more priorities and behind the scenes of a priority queue a new backing
hue is created so each priority level uses an internal cue on the airline virtual machine which takes up some resources and using 255 or even
thousands of priorities means you will have resource usage usage similar to having close to that many cues and in most use cases it's sufficient to
have no more than five priority levels this is fixed in revs and q3 points seven point six the max priority cap for
queues is now a force consist to 255 and applications that rely on a higher number of priorities will break and such application must be updated to use no no
more than 255 priorities because we had like two weeks ago or maybe three weeks ago a case where two consumers where you were starting up for every time humanity
took real a long time and memory usage just exploded as you can see there and this was despite few cues and few messages which is the common common
error when it takes a long time to restart a broker but this time it was due to many priorities led priority levels
everyone needs to be prepared for brokering starts broken hardware failure or server crashes and to ensure that messages and brokered definitions survive restarts we know - we need to
know that ensure that they are on disk and messages exchanges and queues that are not durable and persistent are lost during a broker restart so make sure
that your queue is declared as durable and messages are this are sent with delivery mode persistent and remember persistent messages are heavier as they
have to barista has to be written to disk so for high performance it's better to use transit messages and temporary or
non non durable queues and then we have the prefetch which is used to specify how many messages that is sent to the
consumer and cached by the Rev dump you client library how many messages the client can receive before acknowledging a message and it's used to get as much out of the consumer as possible
and rub dump you default prefetch setting gives clients and unlimited buffer meaning that rabbitmq by default send as many messages as it can to any
consumer that looks ready to receive them or accept them and messages are sent or cached by the as I said the Rev Thank You client library in the consumer
until it has been processed so not setting in prefetch can lead to clients running out of memory and makes it impossible to scale out with more
consumers in rabbitmq we got everything here 3.7 we got a new option to adjust the default prefetch value and this
value is by default set to thousand on on all new cloud and Kewpie servers we
will russian trip to 3.7 or or higher or it doesn't really exist yet but soon a to small prefetch count may hurt performance since its most on the top of
the time waiting for permissions to send more messages and this figure is illustrating along idling time in the example we have a prefetch setting of 1 and this means that rabbitmq won't stand
out and message until the round-trips completes where were around rapist deliverer process and acknowledge and we have in this image and a total round-trip time
of 125 milliseconds with a processing time of only 5 milliseconds so too low well you will keep the consumer in a
consumer idling a lot since they need to wait for messages to arrive a large prepaid count count on the other hand I
could deliver lots of messages to one single consumer and keep that consumer base a while other consumers are held in an idling state and in this image we
have one client that is has a lot to do and one is just waiting and there's
nothing to do so if you have one single or few consumers and our processing message quickly we recommend prefetching mandamus its messages at once and try to
keep your client as busy as possible and if you have about the same processing time all the time and the network behavior remains the same you can use
the total round-trip time / processing time on the client for each message to get the estimated prefetch value for
each message and if you have many consumers and short processing time we recommend the lower prefetch value then for a single or a few consumer and
finally if you have many customers and/or long processing time we recommend setting prefetch count to 1 so that
messages are evenly distributed among all your workers and here's nothing that you should remember that's it's that your client if your client I'll track
messages the prefetch well you have no effect hype increases servers throughput
at the cost of increased startup time so when you enable high RabbitMQ is compiled at startup and this throughput increases with 20 to 80 percent
according to benchmark tests the drawback of hype is that it's that startup time crease is quite a lot to 1 to 3 minutes and if therefore not recommended if you
require high availability due to this long startup time we don't consider hypest experimental any longer 6% of our
clusters has hype enabled and we haven't seen issues we've had probably really long time acknowledgments let the server
and client know if the message has to be retransmitted again and the client can either act the message when it receives it or when the client has completely
processed the message so pay attention to where in your consumer logic you are acknowledging messages a consuming
application that receives essential messages should not acknowledge messages until it has finished whatever it needs to do with them so that unprocessed
messages don't go missing in case of worker crashes exceptions etc an acknowledgement has a performance impact
so for the fastest possible throughput manual action be disabled publish confirm is the same thing but for publishing and the server acts when it
has received a message from the publisher publish confirm also has a performance impact but however want you to keep in mind that it's required if
the publisher needs messages to be processed at least once thanks to the RabbitMQ team great improvements are made all the time which
is really great in 3.7 we go to the default prefetch as I just mentioned and this will probably completely remove cases where the consumer has been killed
you to to a to large message delivery due to an unlimited unlimited prefetch value and in the ADL we host message stores are now available and this helps
us a third cloud and Kewpie a lot because we have two subscription plans that are shared plans which gives a customer a single we host on a
multi-tenant server and this means that our share plans can be even more more stable and even if you have multiple be
hosts on a dedicated plan it will also be more stable RabbitMQ 3.6 had the new feature lasik use that gave many of our
customers a more predictable and stable cluster and we found lasik you features so good that all our new clusters with
web thank you version 3.6 or larger our has lasik use enabled by default many of our customers with issues are running
old versions or an or documented unstable versions and 3.6 I had many memory problems up to version 3 point 6 point 14 and three point five point
seven was good but lacks some good features like like the lasik you feature and we still have lots of servers running 3.5 and this is the image is a
rep time cube version distribution between cloud and QP customers and it's nice to see that so many has upgraded to 3.7 because it's always something we try
to push to like we want our customers to run old stable new stable versions and
we always test new versions before that we set them as available versions in cloudy and QP so the the menu that is
selected in our drop-down menu in on the page where you select which version you want around that that that is the version we recommend at the moment so
this recommendation is stay up to date we think is happening rabbitmq and say a use a stable version and also a stable air language and and a stable client
library version some plugins might be super nice to have but on the other hand they might consume a lot of resources therefore they are not recommended in
production servers so make sure to disable plugins that you are not using an example of a plug-in that we are using a lot but that we are disabling every time every time we are finished
using it it's the top plug-in which we are using when we are troubleshooting rabbit and revved MQ servers for our
customers number 15 even unused Hughes takes up some resources queue index management statistics etc and leaving
temporary queues can eventually cost rev them he runs out of memory so make sure that you don't leave unused queues left behind and the set temporary accuses out
to delete exclusive or how to expire many of our customers are creating be hosts custome host and then they forgot to add a chai policy and then the newbie
host and which costs message lost during net splits and we also we have a chai policy on all our clusters even single nut clusters
because we're using that when when customers are upgrading upgrading to new versions or if they want to change from two no cluster two three not faster we
use that and if they want upgrade grab them keywords etc and here is a summary
of it all and try to have short queues use long live connection have limited use of priority queues use multiple queues consumers and split your queues
over different course use a stable airline in Rev dump you version and also client library version disabled plugins you're not using have channels on all
your connections and separate connections for publisher and consumer don't set management statistic rate mode
mode to detailed in production delete you a new SKUs and set temporary qs s out to delete it and for those of you
who are interested in recommendation for high-performance then we this is even more important which with short queues and use of max length if you're positive possible
do not use lasik use send transit messages disable manual acts and published confirms avoid multiple nodes
enable RabbitMQ hype and for those who are more interested in high availability enabled lasik use have two nodes and don't forget the HJ policy use
persistent messages to durable queues and do not enable hype and the last one is due to this long startup time we have
created a diagnostic tool that is available from the cloud mq p control panel where customers can validate to set up all right time to set up and get a score of this setup and it's been used
by many customers and it's nice to have when we when we get a support request we can check this one first and then always get back to customer and say you need to fix this this this and this
and then the server is all usually running much better after that and here are example of examples of things that
are validated in diagnostic this diagnostic tool and I think I've talked about man if them but not all of them and just come down and talk to us if you
want to see how sees only if you want to see it as we have seen best practice recommendations are different for different use cases and some
applications require high throughput while other applications are publishing batch jobs that can be delayed for a while and other applications just need to have lots of connections and trade
offs have to be done with between performance and guaranteed message delivery etc and our customers are today able to select number of nodes when they
are creating a cluster a single node for high performance and two or three nodes mainly for high availability and or
consists consistency we also have lots of other features billion into the cloud and cubic control panel like the option
to configure alarms for cooling so for missing it comes humours etc and users can view how many messages there has
been in the queue over time which helps us a lot when we are troubleshooting our servers since statistics for the queues
are available all the time and we also have show metrics for usage like CPU RAM and the disk and we have some seen many
different use cases and there were a future plan shadow AMQP to make it even easier for customers to quickly set up a cluster
specified for a selected use case based on best practice recommendations this is
my final slide and it would be nice if we could have a list like this in a community like a list of recommendations because it's makes it makes it so much easier for
beginners to start using RabbitMQ so if you have any recommendations of things that we need to add or if you have different opinions about something just
let me know or some mini made or reach out to me
thanks perfect thank you let's get
started with some questions under Steve want to join me for questions this is our lead developer and he's also the one who should take a lot of credits for the
diagnostic tool that we have
hi do you do you use which public cloud providers to serving to the RabbitMQ and how is the scale up and scale down in
each cluster not using the docker container or easy to to base the virtual machine or another and how which metric
to using to scale down and scale up cluster notes the customer can select data center that was the first person right they can select the data center
when they when they create cluster they kids do the cashews between an Amazon
Rackspace IBM cloud Alibaba cloud all of them all of them and almost the other
part of the question how we root strap we don't use any
doctor yeah now we as artists now just use like the cloud providers different
type guys to spin up instances and we have all our bootstrapping in bash scripts no fancy container stuff and which
metric to using the scale up and scale down policy cluster notes
yeah so basically the same as for bootstrapping we use cloud providers
api's to spin up instances and bootstrap them with our custom scripts and add
them to the rapid cluster remove the smaller old notes so like rolling adding new notes removing old ones
hi one question do you're just doing everything with just with best scripts stand-up no technology like a POS H or something like this behind of this like
kuba needs or docker or B oh Sh now bash just bash okay thank you for the talk that was awesome
um I wonder if you have a strategy to help customers to keep up to date like if you handle part of the upgrades if you provide tools or anything else the
customer kit we have a in our control panel we have a simple button where you press upgrade and we send out information like now now it's a new rep
time question it's good because of this this and this and whenever we can we do that without downtime so it's a if it's
a patch upgrade we do it node by node but if it's if it requires
downtown done we notify you beforehand we put on the slide about high performance you had a bird and the slide
about a che you had like a mound what is that mound because it's Manny
ants yeah it's an anthill and the bird is flying high that's the reason it's
not something I've been thinking about a lot so thank you for the talk would you say that private until three seven is
more stable than three six fourteen and later please be honest RabbitMQ three seven is more stable than
three six I would say yes the early early versions of three six we had a lot
of problems with the current 360 version and all the three seven versions have been working really good and the basic Hughes was this it was a really good
feature for us by no questions yes
final question what's the downside of
using hippy with a che using hi okay
the downside of using hype mainly when I know that has to come back online after
not split or something it can take quite some time to do it with the hyper naval and it takes really long time if there is already a lot of messages in the
queue and then if you add hype on that it takes much longer time okay thank you very much Lisa thank you
[Applause]
Loading video analysis...