65 lines
2.4 KiB
Markdown
65 lines
2.4 KiB
Markdown
> Overview of the infra size we operate
|
|
- Intro
|
|
|
|
> What kind of issues we faced with rabbit
|
|
> Is it a RabbitMQ setup issue or an Openstack issue ?
|
|
|
|
* Issues with rabbit ?
|
|
* flap when rolling out agent / deploying new agent version
|
|
* even crash on big regions
|
|
* network flap / rabbit partition
|
|
* pause-minority helped crash the cluster
|
|
* reset cluster was ... the solution
|
|
|
|
> Which methods did we use to troubleshoot those issues
|
|
> Observability, tools
|
|
|
|
* What's going on with rabbit ?
|
|
* reproduce workload with rabbit perftest
|
|
* oslo.metrics
|
|
* rabbitmq exporter / grafana dashboards
|
|
* smokeping between nodes
|
|
* rabbitspy
|
|
* What we learned ?
|
|
* rabbitmq does not like at all large queue/connection churn
|
|
* identified issues were mostly related to neutron
|
|
* rabbit ddos
|
|
* too many queue declare
|
|
* too many tcp connection churn
|
|
* fanout mechanism 1 message published, duplicated to N queues
|
|
* Nova rpc usage is clearly != neutron
|
|
|
|
> Before going further, let's take some time to understand how oslo.messaging work
|
|
> How RPC is implemented in Openstack
|
|
> [[ oslo.messaging - How it works with rabbit]]
|
|
|
|
* Under the hood ?
|
|
* pub/sub mechanism
|
|
* subscriber: RPC server topic=name
|
|
* setup class endpoints
|
|
* create queues / setup consumer thread
|
|
* publish with rpc provided methods
|
|
* call - reply (topic / transient for reply)
|
|
* cast (topic queue)
|
|
* cast / fanout=true (fanout queue)
|
|
* notifications for external use: kafka
|
|
|
|
> What we did to put rabbits back to their holes
|
|
|
|
* Journey to get a stable infra.
|
|
* Infra
|
|
* split rabbit-neutron / rabbit-\*
|
|
* scale problematic clusters to 5 node
|
|
* Upgrade to 3.10+
|
|
* quorum queue recommended
|
|
* put back partition strategy to pause-minority
|
|
* oslo messaging improvments
|
|
* queue fixed naming to avoid queue churn
|
|
* heartbeat in pthread fix
|
|
* move from HA queue > Quorum queues
|
|
* fix to autodelete broken quorum queues
|
|
* replace 'fanout' queues by stream queues
|
|
* reduce queue nb a lot
|
|
* patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
|
|
* reduce queues declared by a RPC server (3 queues by default to only 1)
|
|
* use same connection for mutiple topics |