openinfraday/Follow the RabbitMQ - Plan.md

> Overview of the infra size we operate
- Intro

> What kind of issues we faced with rabbit
> Is it a RabbitMQ setup issue or an Openstack issue ?

* Issues with rabbit ?
    * flap when rolling out agent / deploying new agent version
        * even crash on big regions
    * network flap / rabbit partition
        * pause-minority helped crash the cluster
    * reset cluster was ... the solution

> Which methods did we use to troubleshoot those issues
> Observability, tools

* What's going on with rabbit ?
    * What we deployed to help troubleshooting issues
        * reproduce workload with rabbit perftest
        * oslo.metrics
        * rabbitmq exporter / grafana dashboards
        * smokeping between nodes
        * rabbitspy
    * What we learned ?
        * rabbitmq does not like at all large queue/connection churn
        * identified issues were mostly related to neutron
            * rabbit ddos
                * too many queue declare
                * too many tcp connection churn
                * fanout mechanism 1 message published, duplicated to N queues
        * Nova rpc usage is clearly != neutron

> Before going further, let's take some time to understand how oslo.messaging work
> How RPC is implemented in Openstack
> [[ oslo.messaging - How it works with rabbit]]

* Under the hood ?
    * pub/sub mechanism
        * subscriber: RPC server topic=name
            * setup class endpoints
            * create queues / setup consumer thread
        * publish with rpc provided methods
            * call - reply (topic / transient for reply)
            * cast (topic queue)
            * cast / fanout=true (fanout queue)
    * notifications for external use: kafka

> What we did to put rabbits back to their holes

* Journey to get a stable infra.
    * Infra improvment
        * split rabbit-neutron / rabbit-\*
        * scale problematic clusters to 5 node
        * Upgrade to 3.10+
            * quorum queue recommended
        * put back partition strategy to pause-minority
    * oslo messaging improvments
        * queue fixed naming to avoid queue churn
        * heartbeat in pthread fix
        * move from HA queue > Quorum queues
            * fix to autodelete broken quorum queues
        * replace 'fanout' queues by stream queues
            * reduce queue nb a lot
            * patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
        * reduce queues declared by a RPC server (3 queues by default to only 1)
        * use same connection for mutiple topics