Autocommit action=MODIFY on file=Plan.md detected
This commit is contained in:
parent
0866e4a7e0
commit
eeff712455
95
Plan.md
95
Plan.md
|
|
@ -1,46 +1,65 @@
|
||||||
- Issues with rabbit ?
|
|
||||||
- flap when rolling out agent / deploying new agent version
|
> What kind of issues we faced with rabbit
|
||||||
- even crash on big regions
|
> Is it a RabbitMQ setup issue or an Openstack issue ?
|
||||||
- network flap / rabbit partition
|
|
||||||
- pause-minority helped crash the cluster
|
* Issues with rabbit ?
|
||||||
- reset cluster was ... the solution
|
* flap when rolling out agent / deploying new agent version
|
||||||
|
* even crash on big regions
|
||||||
|
* network flap / rabbit partition
|
||||||
|
* pause-minority helped crash the cluster
|
||||||
|
* reset cluster was ... the solution
|
||||||
|
|
||||||
|
|
||||||
- What's going on with rabbit ?
|
> Which methods did we use to troubleshoot those issues
|
||||||
- reproduce workload with rabbit perftest
|
> Observability, tools
|
||||||
- oslo.metrics
|
|
||||||
- rabbitmq exporter / grafana dashboards
|
|
||||||
- smokeping between nodes
|
|
||||||
|
|
||||||
- What we learned ?
|
* What's going on with rabbit ?
|
||||||
- rabbitmq does not like at all large queue/connection churn
|
* reproduce workload with rabbit perftest
|
||||||
- identified issues were mostly related to neutron
|
* oslo.metrics
|
||||||
- rabbit ddos
|
* rabbitmq exporter / grafana dashboards
|
||||||
- too many queue declare
|
* smokeping between nodes
|
||||||
- too many tcp connection churn
|
* rabbitspy
|
||||||
- Nova rpc usage is clearly != neutron
|
* What we learned ?
|
||||||
|
* rabbitmq does not like at all large queue/connection churn
|
||||||
|
* identified issues were mostly related to neutron
|
||||||
|
* rabbit ddos
|
||||||
|
* too many queue declare
|
||||||
|
* too many tcp connection churn
|
||||||
|
* fanout mechanism 1 message published, duplicated to N queues
|
||||||
|
* Nova rpc usage is clearly != neutron
|
||||||
|
|
||||||
|
|
||||||
- Under the hood ? RPC implementation in Openstack: aka oslo.messaging
|
> Before going further, let's take some time to understand how oslo.messaging work
|
||||||
- pub/sub
|
> How RPC is implemented in Openstack
|
||||||
- RPC server: setup endpoints / queues / listeners
|
|
||||||
- publish: rpc provided methods
|
* Under the hood ?
|
||||||
- call - reply (topic / transient for reply)
|
* pub/sub mechanism
|
||||||
- cast (topic queue)
|
* subscriber: RPC server topic=name
|
||||||
- cast / fanout=true (fanout queue)
|
* setup class endpoints
|
||||||
- notifications: kafka
|
* create queues / setup consumer thread
|
||||||
|
* publish with rpc provided methods
|
||||||
|
* call - reply (topic / transient for reply)
|
||||||
|
* cast (topic queue)
|
||||||
|
* cast / fanout=true (fanout queue)
|
||||||
|
* notifications for external use: kafka
|
||||||
|
|
||||||
|
|
||||||
- Journey to get stable
|
> What we did to put rabbits back to their holes
|
||||||
- Infra
|
|
||||||
- split rabbit-neutron / rabbit-*
|
|
||||||
- scale problematic clusters to 5 node
|
|
||||||
- Upgrade to 3.10+
|
|
||||||
- quorum queue recommended
|
|
||||||
- oslo messaging improvment
|
|
||||||
- queue fixed naming to avoid
|
|
||||||
- move from HA queue > Quorum queues
|
|
||||||
- replace 'fanout' queues by stream queues => reduce queue nb
|
|
||||||
- reduce queue declared by RPC server
|
|
||||||
- use same connection for mutiple topics
|
|
||||||
|
|
||||||
|
* Journey to get a stable infra.
|
||||||
|
* Infra
|
||||||
|
* split rabbit-neutron / rabbit-\*
|
||||||
|
* scale problematic clusters to 5 node
|
||||||
|
* Upgrade to 3.10+
|
||||||
|
* quorum queue recommended
|
||||||
|
* put back partition strategy to pause-minority
|
||||||
|
* oslo messaging improvments
|
||||||
|
* queue fixed naming to avoid queue churn
|
||||||
|
* heartbeat in pthread fix
|
||||||
|
* move from HA queue > Quorum queues
|
||||||
|
* fix to autodelete broken quorum queues
|
||||||
|
* replace 'fanout' queues by stream queues
|
||||||
|
* reduce queue nb a lot
|
||||||
|
* patch to avoid tcp reconnection when a queue is deleted (kombu/oslo)
|
||||||
|
* reduce queues declared by a RPC server (3 queues by default to only 1)
|
||||||
|
* use same connection for mutiple topics
|
||||||
Loading…
Reference in New Issue