1.5 KiB
1.5 KiB
-
Issues with rabbit ?
- flap when rolling out agent / deploying new agent version
- even crash on big regions
- network flap / rabbit partition
- pause-minority helped crash the cluster
- reset cluster was ... the solution
- flap when rolling out agent / deploying new agent version
-
What's going on with rabbit ?
-
reproduce workload with rabbit perftest
-
oslo.metrics
-
rabbitmq exporter / grafana dashboards
-
smokeping between nodes
-
What we learned ?
- rabbitmq does not like at all large queue/connection churn
- identified issues were mostly related to neutron
- rabbit ddos
- too many queue declare
- too many tcp connection churn
- rabbit ddos
- Nova rpc usage is clearly != neutron
-
-
Under the hood ? RPC implementation in Openstack: aka oslo.messaging
- pub/sub
- RPC server: setup endpoints / queues / listeners
- publish: rpc provided methods
- call - reply (topic / transient for reply)
- cast (topic queue)
- cast / fanout=true (fanout queue)
- notifications: kafka
- pub/sub
-
Journey to get stable
- Infra
- split rabbit-neutron / rabbit-*
- scale problematic clusters to 5 node
- Upgrade to 3.10+
- quorum queue recommended
- oslo messaging improvment
- queue fixed naming to avoid
- move from HA queue > Quorum queues
- replace 'fanout' queues by stream queues => reduce queue nb
- reduce queue declared by RPC server
- use same connection for mutiple topics
- Infra