openinfraday/Plan.md at 0866e4a7e080da7f61888e7118cf63c70c218436 - openinfraday - Gitea: Git with a cup of tea

1.5 KiB

Raw Blame History

Issues with rabbit ?
- flap when rolling out agent / deploying new agent version
  - even crash on big regions
- network flap / rabbit partition
  - pause-minority helped crash the cluster
- reset cluster was ... the solution
What's going on with rabbit ?
- reproduce workload with rabbit perftest
- oslo.metrics
- rabbitmq exporter / grafana dashboards
- smokeping between nodes
- What we learned ?
  - rabbitmq does not like at all large queue/connection churn
  - identified issues were mostly related to neutron
    - rabbit ddos
      - too many queue declare
      - too many tcp connection churn
  - Nova rpc usage is clearly != neutron
Under the hood ? RPC implementation in Openstack: aka oslo.messaging
- pub/sub
  - RPC server: setup endpoints / queues / listeners
  - publish: rpc provided methods
    - call - reply (topic / transient for reply)
    - cast (topic queue)
    - cast / fanout=true (fanout queue)
- notifications: kafka
Journey to get stable
- Infra
  - split rabbit-neutron / rabbit-*
  - scale problematic clusters to 5 node
  - Upgrade to 3.10+
    - quorum queue recommended
- oslo messaging improvment
  - queue fixed naming to avoid
  - move from HA queue > Quorum queues
  - replace 'fanout' queues by stream queues => reduce queue nb
  - reduce queue declared by RPC server
  - use same connection for mutiple topics