diff --git a/Plan.md b/Plan.md index f2c452a..9b8b837 100644 --- a/Plan.md +++ b/Plan.md @@ -1,46 +1,65 @@ -- Issues with rabbit ? - - flap when rolling out agent / deploying new agent version - - even crash on big regions - - network flap / rabbit partition - - pause-minority helped crash the cluster - - reset cluster was ... the solution + +> What kind of issues we faced with rabbit +> Is it a RabbitMQ setup issue or an Openstack issue ? + +* Issues with rabbit ? + * flap when rolling out agent / deploying new agent version + * even crash on big regions + * network flap / rabbit partition + * pause-minority helped crash the cluster + * reset cluster was ... the solution -- What's going on with rabbit ? - - reproduce workload with rabbit perftest - - oslo.metrics - - rabbitmq exporter / grafana dashboards - - smokeping between nodes +> Which methods did we use to troubleshoot those issues +> Observability, tools - - What we learned ? - - rabbitmq does not like at all large queue/connection churn - - identified issues were mostly related to neutron - - rabbit ddos - - too many queue declare - - too many tcp connection churn - - Nova rpc usage is clearly != neutron - - -- Under the hood ? RPC implementation in Openstack: aka oslo.messaging - - pub/sub - - RPC server: setup endpoints / queues / listeners - - publish: rpc provided methods - - call - reply (topic / transient for reply) - - cast (topic queue) - - cast / fanout=true (fanout queue) - - notifications: kafka +* What's going on with rabbit ? + * reproduce workload with rabbit perftest + * oslo.metrics + * rabbitmq exporter / grafana dashboards + * smokeping between nodes + * rabbitspy + * What we learned ? + * rabbitmq does not like at all large queue/connection churn + * identified issues were mostly related to neutron + * rabbit ddos + * too many queue declare + * too many tcp connection churn + * fanout mechanism 1 message published, duplicated to N queues + * Nova rpc usage is clearly != neutron -- Journey to get stable - - Infra - - split rabbit-neutron / rabbit-* - - scale problematic clusters to 5 node - - Upgrade to 3.10+ - - quorum queue recommended - - oslo messaging improvment - - queue fixed naming to avoid - - move from HA queue > Quorum queues - - replace 'fanout' queues by stream queues => reduce queue nb - - reduce queue declared by RPC server - - use same connection for mutiple topics - +> Before going further, let's take some time to understand how oslo.messaging work +> How RPC is implemented in Openstack + +* Under the hood ? + * pub/sub mechanism + * subscriber: RPC server topic=name + * setup class endpoints + * create queues / setup consumer thread + * publish with rpc provided methods + * call - reply (topic / transient for reply) + * cast (topic queue) + * cast / fanout=true (fanout queue) + * notifications for external use: kafka + + +> What we did to put rabbits back to their holes + +* Journey to get a stable infra. + * Infra + * split rabbit-neutron / rabbit-\* + * scale problematic clusters to 5 node + * Upgrade to 3.10+ + * quorum queue recommended + * put back partition strategy to pause-minority + * oslo messaging improvments + * queue fixed naming to avoid queue churn + * heartbeat in pthread fix + * move from HA queue > Quorum queues + * fix to autodelete broken quorum queues + * replace 'fanout' queues by stream queues + * reduce queue nb a lot + * patch to avoid tcp reconnection when a queue is deleted (kombu/oslo) + * reduce queues declared by a RPC server (3 queues by default to only 1) + * use same connection for mutiple topics \ No newline at end of file