openinfraday/Follow the RabbitMQ - Plan.md at 92d358854248405836e0f21afa990edf0dcd84a3 - openinfraday

2.7 KiB

Raw Blame History

Overview of the infra size we operate

Intro

What kind of issues we faced with rabbit Is it a RabbitMQ setup issue or an Openstack issue ?

Issues with rabbit ?
- flap when rolling out agent / deploying new agent version
  - even crash on big regions
- network flap / rabbit partition
  - pause-minority helped crash the cluster
- reset cluster was ... the solution

Which methods did we use to troubleshoot those issues Observability, tools

What's going on with rabbit ?
- What we deployed to help troubleshooting issues
  - reproduce workload with rabbit perftest
  - oslo.metrics
  - rabbitmq exporter / grafana dashboards
  - smokeping between nodes
  - rabbitspy
- What we learned ?
  - rabbitmq does not like at all large queue/connection churn
  - identified issues were mostly related to neutron
    - rabbit ddos
      - too many queue declare
      - too many tcp connection churn
      - fanout mechanism 1 message published, duplicated to N queues
  - Nova rpc usage is clearly != neutron

Before going further, let's take some time to understand how oslo.messaging work How RPC is implemented in Openstack oslo.messaging - How it works with rabbit

Under the hood ?
- pub/sub mechanism
  - subscriber: RPC server topic=name
    - setup class endpoints
    - create queues / setup consumer thread
  - publish with rpc provided methods
    - call - reply (topic / transient for reply)
    - cast (topic queue)
    - cast / fanout=true (fanout queue)
- notifications for external use: kafka

What we did to put rabbits back to their holes

Journey to get a stable infra.
- Infra improvment
  - split rabbit-neutron / rabbit-*
  - scale problematic clusters to 5 node
  - Upgrade to 3.10+
    - quorum queue recommended
  - put back partition strategy to pause-minority
- oslo messaging improvments
  - queue fixed naming to avoid queue churn
  - heartbeat in pthread fix
  - move from HA queue > Quorum queues
    - fix to autodelete broken quorum queues
  - replace 'fanout' queues by stream queues
    - reduce queue nb a lot
    - patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
  - reduce queues declared by a RPC server (3 queues by default to only 1)
  - use same connection for mutiple topics

...

Conclusion
- when rabbitmq is used for what it is designed for, it works better
- going further ?
  - let's write an oslo.messaging driver for another backend ?

2.7 KiB Raw Blame History

2.7 KiB

Raw Blame History