Skip to main content

 Troubleshooting Kubernetes disk pressure or disk latency in VMware Aria Automation and Automation Orchestrator 8.x

https://knowledge.broadcom.com/external/article?articleNumber=326110


Issue/Introduction

If VMware Aria Automation or Automation Orchestrator services are not running but the node shows it is in ready state with pods in pending, the steps in this document may alleviate the issue and bring the cluster back online.

Symptoms:

Disk Pressure

  • Pods are all in Pending state.
  • Review of kube-system pods show multiple evicted pods recreating and evicting again in a loop:
    kubectl get pods -n kube-system

Disk Latency

  • journalctl contains errors for etcd similar to the below:
    ...server is likely overloaded
    ...failed to send out heartbeat on time (exceeded the 100ms timeout for 17.346965944s, to 9a4c6c6012cbdb5a)
  • The kubelet and kube-api-server services restart randomly.
  • /opt/scripts/deploy.sh may fail
  • kubectl -n prelude get pods /Nodes  command randomly gives the following error :
      • Unable to connect to the server: dial tcp: lookup vra-k8s.local on <DNS-IP>:53: no such host


Environment

Aria Automation 8.x
Aria Automation Orchestrator 8.x
VMware vRealize Automation 8.x (vRA)
VMware vRealize Orchestrator 8.x (vRO)

Cause

Disk Pressure

Either disk or memory pressure exists on one of the appliances in the cluster which will cause kube-system pods to evict. This will place the prelude pods into a pending state causing the system to become non-functional

To confirm if this is the case, review the journal with the following command:

journalctl -u kubelet

If the journal is very large, you can pipe to grep and look for entries relating to Disk Pressure or Memory Pressure. If further log reviewing is needed, add another grep via pipe and search by the date in format CCC ##" (e.g. - MAR 10):

grep -i journalctl -u kubelet | grep -i pressure

Disk Latency

Maximum Storage Latency is 20 ms for each disk IO operation from any Aria Automation node under the official product documentation System Requirements (techdocs.broadcom.com).

Resolution

Disk Pressure

  1. Confirm that disk pressure is the issue with the steps in the "Cause" section of this document.
  2. Verify by running vracli disk-mgr and df -i to check disk space and inode availability.
    1. If disk use on the primary disk (generally SDA4) or inode utilization on the disk is above 80%, increase the size of the disk in vCenter, then  at the terminal to expand the disk run:
      vracli disk-mgr resize
  3. Monitor kube-system with watch kubectl get pods -n kube-system and verify that the evictions stop and pods return to a running state. This may take several minutes.
  4. Monitor your prelude pods with watch kubectl get pods -n prelude to confirm the prelude pods are starting. This will also take several minutes.

Disk Latency

  1. Move the VMware Aria Automation or Automation Orchestrator appliances to Storage that can meet the Maximum Storage Latency requirements as defined by the official product documentation.


Workaround:

Pods stuck in Pending state

It is possible the pods will still stay in pending an not restart on their own. If this occurs, there are a few situations that may cause this:
  1. If the disk was completely full on one of the nodes, it's possible that the docker images corrupted or otherwise encountered an issue.
  2. There are problems in the kube-system pods.
If after 5-10 minutes waiting this is the case and no prelude pods have moved from pending to running, do the following:
  1. Check for fluentd service via "systemctl status fluentd" and see if it's healthy. If this VRA is 8.1 or older, it will likely be "service fluentd status" instead. Restart the service if needed via "systemctl restart fluentd" (VRA 8.2+) or "service fluentd restart" (VRA8.1 and below).
  2. If service is not restarting properly, run "/opt/scripts/restore-docker-images.sh" on all VRA nodes.
Once you've confirmed fluentd is in a healthy/running state, check for kube-system pods not starting:
  1. Run
    kubectl get pods -n kube-system
  2. Check for any pods that are in non-running or completed states (e.g. "container-creating" or "error")
  3. Run the below command to rebuild them if they are in a non-running or non-completed state and wait for this process to complete
    kubectl delete pods -n kube-system podName
Once all kube-system pods are in a healthy state, you can monitor with kubectl get pods -n prelude --watch again to see if the pods start changing to running state. If the system still does not recover after several minutes, do the following:

Procedure to Restart Services

  1. Run
    /opt/scripts/deploy.sh --shutdown
  2. Monitor pods in a separate terminal window to confirm they tear down successfully.
  3. Run
    /opt/scripts/deploy.sh


Additional Information

Impact/Risks:
VMware Aria Automation or Automation Orchestrator services become inaccessible from the web interface and you can no longer login.

Comments

Popular posts from this blog

  Issue with Aria Automation Custom form Multi Value Picker and Data Grid https://knowledge.broadcom.com/external/article?articleNumber=345960 Products VMware Aria Suite Issue/Introduction Symptoms: Getting  error " Expected Type String but was Object ", w hen trying to use Complex Types in MultiValue Picker on the Aria for Automation Custom Form. Environment VMware vRealize Automation 8.x Cause This issue has been identified where the problem appears when a single column Multi Value Picker or Data Grid is used. Resolution This is a known issue. There is a workaround.  Workaround: As a workaround, try adding one empty column in the Multivalue picker without filling the options. So we can add one more column without filling the value which will be hidden(there is a button in the designer page that will hide the column). This way the end user will receive the same view.  

57 Tips Every Admin Should Know

Active Directory 1. To quickly list all the groups in your domain, with members, run this command: dsquery group -limit 0 | dsget group -members –expand 2. To find all users whose accounts are set to have a non-expiring password, run this command: dsquery * domainroot -filter “(&(objectcategory=person)(objectclass=user)(lockoutTime=*))” -limit 0 3. To list all the FSMO role holders in your forest, run this command: netdom query fsmo 4. To refresh group policy settings, run this command: gpupdate 5. To check Active Directory replication on a domain controller, run this command: repadmin /replsummary 6. To force replication from a domain controller without having to go through to Active Directory Sites and Services, run this command: repadmin /syncall 7. To see what server authenticated you (or if you logged on with cached credentials) you can run either of these commands: set l echo %logonserver% 8. To see what account you are logged on as, run this command: ...
  The Guardrails of Automation VMware Cloud Foundation (VCF) 9.0 has redefined private cloud automation. With full-stack automation powered by Ansible and orchestrated through vRealize Orchestrator (vRO), and version-controlled deployments driven by GitOps and CI/CD pipelines, teams can build infrastructure faster than ever. But automation without guardrails is a recipe for risk Enter RBAC and policy enforcement. This third and final installment in our automation series focuses on how to secure and govern multi-tenant environments in VCF 9.0 with role-based access control (RBAC) and layered identity management. VCF’s IAM Foundation VCF 9.x integrates tightly with enterprise identity providers, enabling organizations to define and assign roles using existing Active Directory (AD) groups. With its persona-based access model, administrators can enforce strict boundaries across compute, storage, and networking resources: Personas : Global Admin, Tenant Admin, Contributor, Viewer Projec...