openinfraday/Follow the RabbitMQ - Plan.md

2.7 KiB

Overview of the infra size we operate

  • Intro

What kind of issues we faced with rabbit Is it a RabbitMQ setup issue or an Openstack issue ?

  • Issues with rabbit ?
    • flap when rolling out agent / deploying new agent version
      • even crash on big regions
    • network flap / rabbit partition
      • pause-minority helped crash the cluster
    • reset cluster was ... the solution

Which methods did we use to troubleshoot those issues Observability, tools

  • What's going on with rabbit ?
    • What we deployed to help troubleshooting issues
      • reproduce workload with rabbit perftest
      • oslo.metrics
      • rabbitmq exporter / grafana dashboards
      • smokeping between nodes
      • rabbitspy
    • What we learned ?
      • rabbitmq does not like at all large queue/connection churn
      • identified issues were mostly related to neutron
        • rabbit ddos
          • too many queue declare
          • too many tcp connection churn
          • fanout mechanism 1 message published, duplicated to N queues
      • Nova rpc usage is clearly != neutron

Before going further, let's take some time to understand how oslo.messaging work How RPC is implemented in Openstack oslo.messaging - How it works with rabbit

  • Under the hood ?
    • pub/sub mechanism
      • subscriber: RPC server topic=name
        • setup class endpoints
        • create queues / setup consumer thread
      • publish with rpc provided methods
        • call - reply (topic / transient for reply)
        • cast (topic queue)
        • cast / fanout=true (fanout queue)
    • notifications for external use: kafka

What we did to put rabbits back to their holes

  • Journey to get a stable infra.
    • Infra improvment
      • split rabbit-neutron / rabbit-*
      • scale problematic clusters to 5 node
      • Upgrade to 3.10+
        • quorum queue recommended
      • put back partition strategy to pause-minority
    • oslo messaging improvments
      • queue fixed naming to avoid queue churn
      • heartbeat in pthread fix
      • move from HA queue > Quorum queues
        • fix to autodelete broken quorum queues
      • replace 'fanout' queues by stream queues
        • reduce queue nb a lot
        • patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
      • reduce queues declared by a RPC server (3 queues by default to only 1)
      • use same connection for mutiple topics

...

  • Conclusion
    • when rabbitmq is used for what it is designed for, it works better
    • going further ?
      • let's write an oslo.messaging driver for another backend ?