From 1014610a8b7b684bee230218c9350adb0d473d41 Mon Sep 17 00:00:00 2001
From: Flatnotes <flatnotes@autocommit>
Date: Sun, 5 May 2024 18:07:21 +0000
Subject: [PATCH] Autocommit action=MODIFY on file=1. Follow the Rabbitmq.md
 detected

---
 1. Follow the Rabbitmq.md | 102 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 101 insertions(+), 1 deletion(-)

diff --git a/1. Follow the Rabbitmq.md b/1. Follow the Rabbitmq.md
index 30d74d2..04f6e8c 100644
--- a/1. Follow the Rabbitmq.md	
+++ b/1. Follow the Rabbitmq.md	
@@ -1 +1,101 @@
-test
\ No newline at end of file
+ 
+RabbitMQ recent improvments
+
+RabbitMQ is a key component in OpenStack deployment.
+Both nova and neutron heavily rely on it for intra communication (between agents running on computes and API running on control plane).
+RabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ. This is also true when rabbitmq is running in a kubernetes environment.
+OpenStack components consume rabbitMQ through oslo.messaging.
+
+Some recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues.
+
+Here is a list of what we did on OVH side to achieve better stability at large scale.
+
+
+
+
+- Better eventlet / green thread management
+AMQP protocol rely on "heartbeats" to keep idle connection open.
+Two patches were done in oslo.messaging to send hearbeats correctly:
+the first patch was about sending heartbeats more often to respect the protocol definition.
+the second patch was about using native threads instead of green thread to send hearbeats.
+Green threads could be paused by eventlet under some circumstances, leading to connection beeing dropped by rabbitmq because of missed heartbeats.
+While dropping and creating a new connection is not a big deal on small deployment, it leads to some messages loss and a lot of TCP churn at large scale.
+
+Both patches are merged upstream and available by default.
+
+
+
+
+- Replace classic HA with quorum
+Rabbitmq is moving out of HA classic queues and replacing those with Quorum queues (based on raft algorithm).
+This is a huge improvment on rabbitmq side. This allow better scalability as well as redundancy of data.
+Quorum queues were partially implemented on oslo.messaging.
+
+OVH did a patch to finish this implementation (for 'transient' queues)
+
+Using quorum queues is not yet the default and we would like to enable this by default.
+
+
+
+
+- Consistent queue naming
+oslo.messaging was relying on random queue naming.
+While this seems not a problem on small deployments, it has two bad side effects :
+- it's harder to figure out which service created a specific queue
+- as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq
+
+These side effects are highly visible at large scale, and even more visible when using quorum queues.
+
+We did a patch on oslo.messaging to stop using random name.
+
+This is now merged upstream, but disable by default.
+We would like to enable this by default in the future.
+
+
+
+
+
+- Reduce the number of queues
+Both neutron and nova are heavily relying on rabbitmq communication.
+While nova is the one sending most messages (5x more than neutron), neutron is the one creating most queues (10x more than nova).
+RabbitMQ is a message broker, not a queue broker.
+Neutron is creating a lot of queues without even using them (neutron instanciate oslo.messaging for one queue, but oslo.messaging is creating multiples queues for multiple purpose, even if neutron does not need them)
+With a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.).
+
+OVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5).
+
+We would like to push this upstream.
+
+
+
+
+
+- Replace classic fanouts with streams
+Both neutron and nova rely on fanout queues to send messages to all computes.
+Neutron mostly use that to trigger a security group update or any other update on object (populating the remote cache).
+
+When classic queues were used to perform such thing, messages were replicated in all queues for all computes.
+If you were having a region with 2k computes, you would be sending 2k identical messages in 2k queues (1 message per queue). This is not efficient at all.
+
+OVH did a patch to rely on "stream" queues to replace classic fanouts.
+With stream queues, all computes listen to the same queue, so only 1 message is sent to 1 queue and is received on 2k computes.
+This is also reducing the number of queues on rabbitmq.
+
+Those patches are merged upstream but disabled by default
+
+We would like to enable this by default.
+
+
+
+
+- Get rid of 'transient' queues
+oslo.messaging is distinguishing 'transient' queues from other queues but it make no sense anymore.
+Neutron and nova are expecting all queues to be fully replicated and highly available.
+There is no transient concept in nova / neutron code.
+This concept lead to bad practices when managing rabbitmq cluster. E.G. not replicating the transient queues, which is bad for both nova and neutron.
+
+OVH stopped distinguishing transients and manage all queues in a high available fashion (using quorum queues).
+This allow us the stop a rabbitmq server from the cluster without any impact on the service.
+
+What we would like is to patch oslo.messaging in the future to stop considering some queues as transient.
+This would simplify the code a lot.