openinfraday/Follow the RabbitMQ - Plan.md

66 lines
2.5 KiB
Markdown

> Overview of the infra size we operate
- Intro
> What kind of issues we faced with rabbit
> Is it a RabbitMQ setup issue or an Openstack issue ?
* Issues with rabbit ?
* flap when rolling out agent / deploying new agent version
* even crash on big regions
* network flap / rabbit partition
* pause-minority helped crash the cluster
* reset cluster was ... the solution
> Which methods did we use to troubleshoot those issues
> Observability, tools
* What's going on with rabbit ?
* What we deployed to help troubleshooting issues
* reproduce workload with rabbit perftest
* oslo.metrics
* rabbitmq exporter / grafana dashboards
* smokeping between nodes
* rabbitspy
* What we learned ?
* rabbitmq does not like at all large queue/connection churn
* identified issues were mostly related to neutron
* rabbit ddos
* too many queue declare
* too many tcp connection churn
* fanout mechanism 1 message published, duplicated to N queues
* Nova rpc usage is clearly != neutron
> Before going further, let's take some time to understand how oslo.messaging work
> How RPC is implemented in Openstack
> [[ oslo.messaging - How it works with rabbit]]
* Under the hood ?
* pub/sub mechanism
* subscriber: RPC server topic=name
* setup class endpoints
* create queues / setup consumer thread
* publish with rpc provided methods
* call - reply (topic / transient for reply)
* cast (topic queue)
* cast / fanout=true (fanout queue)
* notifications for external use: kafka
> What we did to put rabbits back to their holes
* Journey to get a stable infra.
* Infra improvment
* split rabbit-neutron / rabbit-\*
* scale problematic clusters to 5 node
* Upgrade to 3.10+
* quorum queue recommended
* put back partition strategy to pause-minority
* oslo messaging improvments
* queue fixed naming to avoid queue churn
* heartbeat in pthread fix
* move from HA queue > Quorum queues
* fix to autodelete broken quorum queues
* replace 'fanout' queues by stream queues
* reduce queue nb a lot
* patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
* reduce queues declared by a RPC server (3 queues by default to only 1)
* use same connection for mutiple topics