Resolve RabbitMQ cluster issues in vRA 8.x deployment
https://knowledge.broadcom.com/external/article?articleNumber=319575
Products
Issue/Introduction
- Below symptoms are noticed:
- Failed to publish event to topic: Deployment resource action requested
- Failed to publish event to topic: Deployment requested
- "Failed to publish event to topic: Deployment resource action requested" or requests do not proceed past the
- Deployment requests are stuck in different life-cycle states for a long time until a time-out is reached.
- All deployment requests start failing and restart of node(s) is necessary to bring environment back.
- Alert every 10-14 days from VROPS: Description: Aria Automation is Down. Object Name: ebs
- Below are log details from
EBS app-serverLogs:-The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value.
- Below are log details from
computing metrics in newChannel: null 2023-11-01T09:14:03.038Z DEBUG event-broker [host='ebs-app-5c66ffc6df-cmtjb' thread='main-pool-35' user='' org='' trace='1XXXXX8-9XXX6-4XXa-bXX5- aXXXXXXXXXXXc' request-trace=''] c.v.a.e.b.s.EventBrokerConfiguration.lambda$initialize$0:123 - Operator Error: (NullPointerException) The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value. java.lang.NullPointerException: The mapper [reactor.rabbitmq.Receiver$ChannelCreationFunction] returned a null value. at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:115) at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400) at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:171) at io.opentracing.contrib.reactor.TracedSubscrib
Environment
- VMware Aria Automation 8.x
Cause
- Suspending vRA node or network partitioning between the vRA nodes cluster deployments, will result in connectivity issues between the RabbitMQ cluster members, which could lead to de-clustered RabbitMQ.
Resolution
- There is no resolution for the issue at the moment as this depends on the
*RabbitMQ* cluster resilience. - Workaround:
To work around the issue, Reset the RabbitMQ cluster:
- SSH login to one of the nodes in the vRA cluster.
- Check the rabbitmq-ha pods status:
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 3d16h
rabbitmq-ha-1 1/1 Running 0 3d16h
rabbitmq-ha-2 1/1 Running 0 3d16h
- If all rabbitmq-ha pods are healthy, check the RabbitMQ cluster status for each of them:
NOTE: Analyze the command output for each RabbitMQ node and verify that the "running_nodes" list contains all cluster members from the "nodes > disc" list::
[{nodes,
[{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local','rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
...]
- If the "running_nodes" list doesn't contain all rabbitMQ cluster members, RabbitMQ is in de-clustered state and needs to be manually reconfigured. For example:
[{disc,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-1.rabbitmq-ha-discovery.prelude.svc.cluster.local',
'rabbit@rabbitmq-ha-2.rabbitmq-ha-discovery.prelude.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.prelude.svc.cluster.local']}
..]
To reconfigure the RabbitMQ cluster, complete the steps below:
- SSH login to one of the vRA nodes
- Reconfigure the RabbitMQ cluster: "vracli reset rabbitmq"
'reset rabbitmq' is a destructive command. Type 'yes' if you want to continue, or 'no' to stop: yes
- Wait until all rabbitmq-ha pods are re-created and healthy: "kubectl -n prelude get pods --selector=app=rabbitmq-ha"
rabbitmq-ha-0 1/1 Running 0 9m53s
rabbitmq-ha-1 1/1 Running 0 9m35s
rabbitmq-ha-2 1/1 Running 0 9m14s
- Delete the ebs pods: "kubectl -n prelude delete pods --selector=app=ebs-app".
- Wait until all ebs pods are re-created and ready: "kubectl -n prelude get pods --selector=app=ebs-app".
ebs-app-84dd59f4f4-jvbsf 1/1 Running 0 2m55s
ebs-app-84dd59f4f4-khv75 1/1 Running 0 2m55s
ebs-app-84dd59f4f4-xthfs 1/1 Running 0 2m55s
- The RabbitMQ cluster is reconfigured. Request a new Deployment to verify that it completes successfully.
Comments
Post a Comment