From ab0f6574fd19a8f84a8fcc7cb265c6166f3ef31f Mon Sep 17 00:00:00 2001 From: Flatnotes Date: Sun, 5 May 2024 20:31:50 +0000 Subject: [PATCH] Autocommit action=MODIFY on file=RabbitMQ recent improvments.md detected --- 1. Follow the Rabbitmq.md | 101 --------------------------------- RabbitMQ recent improvments.md | 78 +++++++++++++++++++++++++ 2 files changed, 78 insertions(+), 101 deletions(-) delete mode 100644 1. Follow the Rabbitmq.md create mode 100644 RabbitMQ recent improvments.md diff --git a/1. Follow the Rabbitmq.md b/1. Follow the Rabbitmq.md deleted file mode 100644 index 04f6e8c..0000000 --- a/1. Follow the Rabbitmq.md +++ /dev/null @@ -1,101 +0,0 @@ - -RabbitMQ recent improvments - -RabbitMQ is a key component in OpenStack deployment. -Both nova and neutron heavily rely on it for intra communication (between agents running on computes and API running on control plane). -RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ. This is also true when rabbitmq is running in a kubernetes environment. -OpenStack components consume rabbitMQ through oslo.messaging. - -Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues. - -Here is a list of what we did on OVH side to achieve better stability at large scale. - - - - -- Better eventlet / green thread management -AMQP protocol rely on "heartbeats" to keep idle connection open. -Two patches were done in oslo.messaging to send hearbeats correctly: -the first patch was about sending heartbeats more often to respect the protocol definition. -the second patch was about using native threads instead of green thread to send hearbeats. -Green threads could be paused by eventlet under some circumstances, leading to connection beeing dropped by rabbitmq because of missed heartbeats. -While dropping and creating a new connection is not a big deal on small deployment, it leads to some messages loss and a lot of TCP churn at large scale. - -Both patches are merged upstream and available by default. - - - - -- Replace classic HA with quorum -Rabbitmq is moving out of HA classic queues and replacing those with Quorum queues (based on raft algorithm). -This is a huge improvment on rabbitmq side. This allow better scalability as well as redundancy of data. -Quorum queues were partially implemented on oslo.messaging. - -OVH did a patch to finish this implementation (for 'transient' queues) - -Using quorum queues is not yet the default and we would like to enable this by default. - - - - -- Consistent queue naming -oslo.messaging was relying on random queue naming. -While this seems not a problem on small deployments, it has two bad side effects : -- it's harder to figure out which service created a specific queue -- as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq - -These side effects are highly visible at large scale, and even more visible when using quorum queues. - -We did a patch on oslo.messaging to stop using random name. - -This is now merged upstream, but disable by default. -We would like to enable this by default in the future. - - - - - -- Reduce the number of queues -Both neutron and nova are heavily relying on rabbitmq communication. -While nova is the one sending most messages (5x more than neutron), neutron is the one creating most queues (10x more than nova). -RabbitMQ is a message broker, not a queue broker. -Neutron is creating a lot of queues without even using them (neutron instanciate oslo.messaging for one queue, but oslo.messaging is creating multiples queues for multiple purpose, even if neutron does not need them) -With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.). - -OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5). - -We would like to push this upstream. - - - - - -- Replace classic fanouts with streams -Both neutron and nova rely on fanout queues to send messages to all computes. -Neutron mostly use that to trigger a security group update or any other update on object (populating the remote cache). - -When classic queues were used to perform such thing, messages were replicated in all queues for all computes. -If you were having a region with 2k computes, you would be sending 2k identical messages in 2k queues (1 message per queue). This is not efficient at all. - -OVH did a patch to rely on "stream" queues to replace classic fanouts. -With stream queues, all computes listen to the same queue, so only 1 message is sent to 1 queue and is received on 2k computes. -This is also reducing the number of queues on rabbitmq. - -Those patches are merged upstream but disabled by default - -We would like to enable this by default. - - - - -- Get rid of 'transient' queues -oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense anymore. -Neutron and nova are expecting all queues to be fully replicated and highly available. -There is no transient concept in nova / neutron code. -This concept lead to bad practices when managing rabbitmq cluster. E.G. not replicating the transient queues, which is bad for both nova and neutron. - -OVH stopped distinguishing transients and manage all queues in a high available fashion (using quorum queues). -This allow us the stop a rabbitmq server from the cluster without any impact on the service. - -What we would like is to patch oslo.messaging in the future to stop considering some queues as transient. -This would simplify the code a lot. diff --git a/RabbitMQ recent improvments.md b/RabbitMQ recent improvments.md new file mode 100644 index 0000000..117d2cc --- /dev/null +++ b/RabbitMQ recent improvments.md @@ -0,0 +1,78 @@ +RabbitMQ is a key component in OpenStack deployment. +Both nova and neutron heavily rely on it for intra communication (between agents running on computes and API running on control plane). +RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ. This is also true when rabbitmq is running in a kubernetes environment. +OpenStack components consume rabbitMQ through oslo.messaging. + +Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues. + +**Here is a list of what we did on OVH side to achieve better stability at large scale.** + +* Better eventlet / green thread management + AMQP protocol rely on "heartbeats" to keep idle connection open. + Two patches were done in oslo.messaging to send hearbeats correctly: + the first patch was about sending heartbeats more often to respect the protocol definition. + the second patch was about using native threads instead of green thread to send hearbeats. + Green threads could be paused by eventlet under some circumstances, leading to connection beeing dropped by rabbitmq because of missed heartbeats. + While dropping and creating a new connection is not a big deal on small deployment, it leads to some messages loss and a lot of TCP churn at large scale. + +***Both patches are merged upstream and available by default.*** + +* Replace classic HA with quorum + Rabbitmq is moving out of HA classic queues and replacing those with Quorum queues (based on raft algorithm). + This is a huge improvment on rabbitmq side. This allow better scalability as well as redundancy of data. + Quorum queues were partially implemented on oslo.messaging. + +OVH did a patch to finish this implementation (for 'transient' queues) + +**Using quorum queues is not yet the default and we would like to enable this by default.** + +* Consistent queue naming + oslo.messaging was relying on random queue naming. + While this seems not a problem on small deployments, it has two bad side effects : +* it's harder to figure out which service created a specific queue +* as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq + +These side effects are highly visible at large scale, and even more visible when using quorum queues. + +**We did a patch on oslo.messaging to stop using random name.** + +This is now merged upstream, but disable by default. +We would like to enable this by default in the future. + +* Reduce the number of queues + Both neutron and nova are heavily relying on rabbitmq communication. + While nova is the one sending most messages (5x more than neutron), neutron is the one creating most queues (10x more than nova). + RabbitMQ is a message broker, not a queue broker. + Neutron is creating a lot of queues without even using them (neutron instanciate oslo.messaging for one queue, but oslo.messaging is creating multiples queues for multiple purpose, even if neutron does not need them) + With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.). + +OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5). + +**We would like to push this upstream.** + +* Replace classic fanouts with streams + Both neutron and nova rely on fanout queues to send messages to all computes. + Neutron mostly use that to trigger a security group update or any other update on object (populating the remote cache). + +When classic queues were used to perform such thing, messages were replicated in all queues for all computes. +If you were having a region with 2k computes, you would be sending 2k identical messages in 2k queues (1 message per queue). This is not efficient at all. + +**OVH did a patch to rely on "stream" queues to replace classic fanouts.** +With stream queues, all computes listen to the same queue, so only 1 message is sent to 1 queue and is received on 2k computes. +This is also reducing the number of queues on rabbitmq. + +Those patches are merged upstream but disabled by default + +**We would like to enable this by default.** + +* Get rid of 'transient' queues + oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense anymore. + Neutron and nova are expecting all queues to be fully replicated and highly available. + There is no transient concept in nova / neutron code. + This concept lead to bad practices when managing rabbitmq cluster. E.G. not replicating the transient queues, which is bad for both nova and neutron. + +OVH stopped distinguishing transients and manage all queues in a high available fashion (using quorum queues). +This allow us the stop a rabbitmq server from the cluster without any impact on the service. + +What we would like is to patch oslo.messaging in the future to stop considering some queues as transient. +This would simplify the code a lot. \ No newline at end of file