openinfraday/Plan.md

1.5 KiB

  • Issues with rabbit ?

    • flap when rolling out agent / deploying new agent version
      • even crash on big regions
    • network flap / rabbit partition
      • pause-minority helped crash the cluster
    • reset cluster was ... the solution
  • What's going on with rabbit ?

    • reproduce workload with rabbit perftest

    • oslo.metrics

    • rabbitmq exporter / grafana dashboards

    • smokeping between nodes

    • What we learned ?

      • rabbitmq does not like at all large queue/connection churn
      • identified issues were mostly related to neutron
        • rabbit ddos
          • too many queue declare
          • too many tcp connection churn
      • Nova rpc usage is clearly != neutron
  • Under the hood ? RPC implementation in Openstack: aka oslo.messaging

    • pub/sub
      • RPC server: setup endpoints / queues / listeners
      • publish: rpc provided methods
        • call - reply (topic / transient for reply)
        • cast (topic queue)
        • cast / fanout=true (fanout queue)
    • notifications: kafka
  • Journey to get stable

    • Infra
      • split rabbit-neutron / rabbit-*
      • scale problematic clusters to 5 node
      • Upgrade to 3.10+
        • quorum queue recommended
    • oslo messaging improvment
      • queue fixed naming to avoid
      • move from HA queue > Quorum queues
      • replace 'fanout' queues by stream queues => reduce queue nb
      • reduce queue declared by RPC server
      • use same connection for mutiple topics