> Overview of the infra size we operate - Intro > What kind of issues we faced with rabbit > Is it a RabbitMQ setup issue or an Openstack issue ? * Issues with rabbit ? * flap when rolling out agent / deploying new agent version * even crash on big regions * network flap / rabbit partition * pause-minority helped crash the cluster * reset cluster was ... the solution > Which methods did we use to troubleshoot those issues > Observability, tools * What's going on with rabbit ? * reproduce workload with rabbit perftest * oslo.metrics * rabbitmq exporter / grafana dashboards * smokeping between nodes * rabbitspy * What we learned ? * rabbitmq does not like at all large queue/connection churn * identified issues were mostly related to neutron * rabbit ddos * too many queue declare * too many tcp connection churn * fanout mechanism 1 message published, duplicated to N queues * Nova rpc usage is clearly != neutron > Before going further, let's take some time to understand how oslo.messaging work > How RPC is implemented in Openstack > [[ oslo.messaging - How it works with rabbit]] * Under the hood ? * pub/sub mechanism * subscriber: RPC server topic=name * setup class endpoints * create queues / setup consumer thread * publish with rpc provided methods * call - reply (topic / transient for reply) * cast (topic queue) * cast / fanout=true (fanout queue) * notifications for external use: kafka > What we did to put rabbits back to their holes * Journey to get a stable infra. * Infra improvment * split rabbit-neutron / rabbit-\* * scale problematic clusters to 5 node * Upgrade to 3.10+ * quorum queue recommended * put back partition strategy to pause-minority * oslo messaging improvments * queue fixed naming to avoid queue churn * heartbeat in pthread fix * move from HA queue > Quorum queues * fix to autodelete broken quorum queues * replace 'fanout' queues by stream queues * reduce queue nb a lot * patch to avoid tcp reconnection when a queue is deleted (kombu/oslo) * reduce queues declared by a RPC server (3 queues by default to only 1) * use same connection for mutiple topics