2.4 KiB
2.4 KiB
Overview of the infra size we operate
- Intro
What kind of issues we faced with rabbit Is it a RabbitMQ setup issue or an Openstack issue ?
- Issues with rabbit ?
- flap when rolling out agent / deploying new agent version
- even crash on big regions
- network flap / rabbit partition
- pause-minority helped crash the cluster
- reset cluster was ... the solution
- flap when rolling out agent / deploying new agent version
Which methods did we use to troubleshoot those issues Observability, tools
- What's going on with rabbit ?
- reproduce workload with rabbit perftest
- oslo.metrics
- rabbitmq exporter / grafana dashboards
- smokeping between nodes
- rabbitspy
- What we learned ?
- rabbitmq does not like at all large queue/connection churn
- identified issues were mostly related to neutron
- rabbit ddos
- too many queue declare
- too many tcp connection churn
- fanout mechanism 1 message published, duplicated to N queues
- rabbit ddos
- Nova rpc usage is clearly != neutron
Before going further, let's take some time to understand how oslo.messaging work How RPC is implemented in Openstack oslo.messaging - How it works with rabbit
- Under the hood ?
- pub/sub mechanism
- subscriber: RPC server topic=name
- setup class endpoints
- create queues / setup consumer thread
- publish with rpc provided methods
- call - reply (topic / transient for reply)
- cast (topic queue)
- cast / fanout=true (fanout queue)
- subscriber: RPC server topic=name
- notifications for external use: kafka
- pub/sub mechanism
What we did to put rabbits back to their holes
- Journey to get a stable infra.
- Infra improvment
- split rabbit-neutron / rabbit-*
- scale problematic clusters to 5 node
- Upgrade to 3.10+
- quorum queue recommended
- put back partition strategy to pause-minority
- oslo messaging improvments
- queue fixed naming to avoid queue churn
- heartbeat in pthread fix
- move from HA queue > Quorum queues
- fix to autodelete broken quorum queues
- replace 'fanout' queues by stream queues
- reduce queue nb a lot
- patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
- reduce queues declared by a RPC server (3 queues by default to only 1)
- use same connection for mutiple topics
- Infra improvment