Autocommit action=MODIFY on file=Plan.md detected

This commit is contained in:
Flatnotes 2024-05-05 20:29:10 +00:00
parent 0866e4a7e0
commit eeff712455
1 changed files with 60 additions and 41 deletions

95
Plan.md
View File

@ -1,46 +1,65 @@
- Issues with rabbit ?
- flap when rolling out agent / deploying new agent version > What kind of issues we faced with rabbit
- even crash on big regions > Is it a RabbitMQ setup issue or an Openstack issue ?
- network flap / rabbit partition
- pause-minority helped crash the cluster * Issues with rabbit ?
- reset cluster was ... the solution * flap when rolling out agent / deploying new agent version
* even crash on big regions
* network flap / rabbit partition
* pause-minority helped crash the cluster
* reset cluster was ... the solution
- What's going on with rabbit ? > Which methods did we use to troubleshoot those issues
- reproduce workload with rabbit perftest > Observability, tools
- oslo.metrics
- rabbitmq exporter / grafana dashboards
- smokeping between nodes
- What we learned ? * What's going on with rabbit ?
- rabbitmq does not like at all large queue/connection churn * reproduce workload with rabbit perftest
- identified issues were mostly related to neutron * oslo.metrics
- rabbit ddos * rabbitmq exporter / grafana dashboards
- too many queue declare * smokeping between nodes
- too many tcp connection churn * rabbitspy
- Nova rpc usage is clearly != neutron * What we learned ?
* rabbitmq does not like at all large queue/connection churn
* identified issues were mostly related to neutron
* rabbit ddos
* too many queue declare
* too many tcp connection churn
* fanout mechanism 1 message published, duplicated to N queues
* Nova rpc usage is clearly != neutron
- Under the hood ? RPC implementation in Openstack: aka oslo.messaging > Before going further, let's take some time to understand how oslo.messaging work
- pub/sub > How RPC is implemented in Openstack
- RPC server: setup endpoints / queues / listeners
- publish: rpc provided methods * Under the hood ?
- call - reply (topic / transient for reply) * pub/sub mechanism
- cast (topic queue) * subscriber: RPC server topic=name
- cast / fanout=true (fanout queue) * setup class endpoints
- notifications: kafka * create queues / setup consumer thread
* publish with rpc provided methods
* call - reply (topic / transient for reply)
* cast (topic queue)
* cast / fanout=true (fanout queue)
* notifications for external use: kafka
- Journey to get stable > What we did to put rabbits back to their holes
- Infra
- split rabbit-neutron / rabbit-*
- scale problematic clusters to 5 node
- Upgrade to 3.10+
- quorum queue recommended
- oslo messaging improvment
- queue fixed naming to avoid
- move from HA queue > Quorum queues
- replace 'fanout' queues by stream queues => reduce queue nb
- reduce queue declared by RPC server
- use same connection for mutiple topics
* Journey to get a stable infra.
* Infra
* split rabbit-neutron / rabbit-\*
* scale problematic clusters to 5 node
* Upgrade to 3.10+
* quorum queue recommended
* put back partition strategy to pause-minority
* oslo messaging improvments
* queue fixed naming to avoid queue churn
* heartbeat in pthread fix
* move from HA queue > Quorum queues
* fix to autodelete broken quorum queues
* replace 'fanout' queues by stream queues
* reduce queue nb a lot
* patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
* reduce queues declared by a RPC server (3 queues by default to only 1)
* use same connection for mutiple topics