Skip to main content

Aria Automation node from a 3-node cluster is down/unavailable, and Provisioning is not functioning

 One Aria Automation node from a 3 node cluster is down/unavailable and Provisioning is not functioning

https://knowledge.broadcom.com/external/article?articleNumber=377795

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

  • one Aria Automation node is down / unavailable due to Infrastructure issues
  • Aria Automation portal is accessible
  • VM provisioning is taking a long time and eventually failing with errors about Event topics e.g.: "Failed to publish event to topic: Deployment requested"
  • reviewing Aria Automation services using command "kubectl -n prelude get pods -o wide" only 1 pods from one node are down
  • reviewing RabbitMQ status using below command, only 1 node shows as active node (Ref: Resolve RabbitMQ cluster issues in vRA 8.x deployment)
    seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"
  • API calls to Aria Automation may fail with HTTP status 500 - Internal Server Error

Environment

Aria Automation 8.x

Cause

A cluster instability may be cause if one of the Aria Automation nodes went down, which may lead to issue in the Messaging Queue, RabbitMQ.

Due to RabbitMQ isolation, another RabbitMQ service was stopped, therefor the last working RabbitMQ service stopped handling any messages.

Resolution

Restore the Aria Automation node, if Linux booted into Emergency console then please review this article:

"Failed to start file system check on /dev/disk..." error on Photon OS based virtual appliances

 

To workaround the issue while the node is not working:

Before proceeding please take a Snapshot, including Memory, of the 2 available nodes from vCenter.

  1. Identify current running RabbitMQ nodes:
    kubectl -n prelude get pods -o wide | grep -Ei "name|rabbitmq"
  2. Identify which pods are currently running the RabbitMQ application:
    seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"

    E.g.:



    Only "rabbitmq-ha-0" is active, depending to which node is down "rabbitmq-ha-1" or "rabbitmq-ha-2" the opposite has to be started
  3. Try to start the RabbitMQ application on the node which is available but not listed as "Running Nodes" from above command.

    E.g.: Node 3 of the Aria Automation cluster has the outage, Node 2 is running but RabbitMQ not reporting
    kubectl exec -n prelude rabbitmq-ha-1 -- bash -c "rabbitmqctl start_app"
  4. Validate that now 2 nodes reporting as running using the same command as Step 2:
    seq 0 2 | xargs -n 1 -I {} kubectl exec -n prelude rabbitmq-ha-{} -- bash -c "rabbitmqctl cluster_status"
  5. Validate provisioning is now proceeding by creating a new Request in Aria Automation portal

Comments

Popular posts from this blog

  Issue with Aria Automation Custom form Multi Value Picker and Data Grid https://knowledge.broadcom.com/external/article?articleNumber=345960 Products VMware Aria Suite Issue/Introduction Symptoms: Getting  error " Expected Type String but was Object ", w hen trying to use Complex Types in MultiValue Picker on the Aria for Automation Custom Form. Environment VMware vRealize Automation 8.x Cause This issue has been identified where the problem appears when a single column Multi Value Picker or Data Grid is used. Resolution This is a known issue. There is a workaround.  Workaround: As a workaround, try adding one empty column in the Multivalue picker without filling the options. So we can add one more column without filling the value which will be hidden(there is a button in the designer page that will hide the column). This way the end user will receive the same view.  

57 Tips Every Admin Should Know

Active Directory 1. To quickly list all the groups in your domain, with members, run this command: dsquery group -limit 0 | dsget group -members –expand 2. To find all users whose accounts are set to have a non-expiring password, run this command: dsquery * domainroot -filter “(&(objectcategory=person)(objectclass=user)(lockoutTime=*))” -limit 0 3. To list all the FSMO role holders in your forest, run this command: netdom query fsmo 4. To refresh group policy settings, run this command: gpupdate 5. To check Active Directory replication on a domain controller, run this command: repadmin /replsummary 6. To force replication from a domain controller without having to go through to Active Directory Sites and Services, run this command: repadmin /syncall 7. To see what server authenticated you (or if you logged on with cached credentials) you can run either of these commands: set l echo %logonserver% 8. To see what account you are logged on as, run this command: ...
  The Guardrails of Automation VMware Cloud Foundation (VCF) 9.0 has redefined private cloud automation. With full-stack automation powered by Ansible and orchestrated through vRealize Orchestrator (vRO), and version-controlled deployments driven by GitOps and CI/CD pipelines, teams can build infrastructure faster than ever. But automation without guardrails is a recipe for risk Enter RBAC and policy enforcement. This third and final installment in our automation series focuses on how to secure and govern multi-tenant environments in VCF 9.0 with role-based access control (RBAC) and layered identity management. VCF’s IAM Foundation VCF 9.x integrates tightly with enterprise identity providers, enabling organizations to define and assign roles using existing Active Directory (AD) groups. With its persona-based access model, administrators can enforce strict boundaries across compute, storage, and networking resources: Personas : Global Admin, Tenant Admin, Contributor, Viewer Projec...