- Issues with rabbit ? - flap when rolling out agent / deploying new agent version - even crash on big regions - network flap / rabbit partition - pause-minority helped crash the cluster - reset cluster was ... the solution - What's going on with rabbit ? - reproduce workload with rabbit perftest - oslo.metrics - rabbitmq exporter / grafana dashboards - smokeping between nodes - What we learned ? - rabbitmq does not like at all large queue/connection churn - identified issues were mostly related to neutron - rabbit ddos - too many queue declare - too many tcp connection churn - Nova rpc usage is clearly != neutron - Under the hood ? RPC implementation in Openstack: aka oslo.messaging - pub/sub - RPC server: setup endpoints / queues / listeners - publish: rpc provided methods - call - reply (topic / transient for reply) - cast (topic queue) - cast / fanout=true (fanout queue) - notifications: kafka - Journey to get stable - Infra - split rabbit-neutron / rabbit-* - scale problematic clusters to 5 node - Upgrade to 3.10+ - quorum queue recommended - oslo messaging improvment - queue fixed naming to avoid - move from HA queue > Quorum queues - replace 'fanout' queues by stream queues => reduce queue nb - reduce queue declared by RPC server - use same connection for mutiple topics