How to remove and rejoin a faulty node in Aria Automation 8.x Cluster
https://knowledge.broadcom.com/external/article?articleNumber=345933
Products
VMware Aria Suite
Issue/Introduction
How to remove and rejoin a faulty node in Aria Automation 8.x Cluster.
Environment
VMware Aria Automation 8.x
Resolution
If it is determined that a node is faulty and we need to remove and rejoin the node in the cluster, take the following steps.
- In vCenter, take backup snapshots of every appliance in the VMware Aria automation HA configuration.(Non-Memory)
- From a root command line on any healthy node, run the following:
kubectl get pod `vracli status | jq -r '.databaseNodes[] | select(.["Role"] == "primary") | .["Node name"]' | cut -d '.' -f 1` -n prelude -o wide --no-headers=true
example:postgres-0 1/1 Running 0 39h ##.###.#.## healthy_node-fqdn-xxx-xx.company.com <none> <none>
Important:The primary database node must be one of the healthy nodes. If the primary database node is faulty, contact technical support instead of proceeding.
- From the root command line of the healthy node, remove the faulty node.
vracli cluster remove faulty-node-FQDN
- From the Faulty node, join the vRealize Automation cluster.
vracli cluster join primary-DB-node-FQDN
- Login as root to the command line of the primary database node.
- Deploy services on the cluster by running the following script.
/opt/scripts/deploy.sh
- Verify by running the command the node is joined and in "Ready" State:
kubectl get nodes
Additional Information
If the faulty node has a damaged etcd
database or other Kubernetes elements, even after being removed from the cluster, then you can reset the k8s system by running this command on the faulty node:
- vracli cluster leave
This can allow the faulty node to join the cluster in cases where the vracli cluster join command above hangs indefinitely (giving no output after 10-15 minutes).
Comments
Post a Comment