Troubleshooting Kubernetes disk pressure or disk latency in VMware Aria Automation and Automation Orchestrator 8.x
https://knowledge.broadcom.com/external/article?articleNumber=326110
Issue/Introduction
Symptoms:
Disk Pressure
- Pods are all in Pending state.
- Review of kube-system pods show multiple evicted pods recreating and evicting again in a loop:
kubectl get pods -n kube-system
Disk Latency
- journalctl contains errors for etcd similar to the below:
...server is likely overloaded ...failed to send out heartbeat on time (exceeded the 100ms timeout for 17.346965944s, to 9a4c6c6012cbdb5a)
- The kubelet and kube-api-server services restart randomly.
- /opt/scripts/deploy.sh may fail
- kubectl -n prelude get pods /Nodes command randomly gives the following error :
Unable to connect to the server: dial tcp: lookup vra-k8s.local on <DNS-IP>:53: no such host
Environment
Aria Automation 8.x
Aria Automation Orchestrator 8.x
VMware vRealize Automation 8.x (vRA)
VMware vRealize Orchestrator 8.x (vRO)
Cause
Disk Pressure
Either disk or memory pressure exists on one of the appliances in the cluster which will cause kube-system pods to evict. This will place the prelude pods into a pending state causing the system to become non-functional
To confirm if this is the case, review the journal with the following command:
journalctl -u kubelet
If the journal is very large, you can pipe to grep and look for entries relating to Disk Pressure or Memory Pressure. If further log reviewing is needed, add another grep via pipe and search by the date in format CCC ##" (e.g. - MAR 10):
grep -i journalctl -u kubelet | grep -i pressure
Disk Latency
Maximum Storage Latency is 20 ms for each disk IO operation from any Aria Automation node under the official product documentation System Requirements (techdocs.broadcom.com).
Resolution
Disk Pressure
- Confirm that disk pressure is the issue with the steps in the "Cause" section of this document.
- Verify by running vracli disk-mgr and df -i to check disk space and inode availability.
- If disk use on the primary disk (generally SDA4) or inode utilization on the disk is above 80%, increase the size of the disk in vCenter, then at the terminal to expand the disk run:
vracli disk-mgr resize
- If disk use on the primary disk (generally SDA4) or inode utilization on the disk is above 80%, increase the size of the disk in vCenter, then at the terminal to expand the disk run:
- Monitor kube-system with watch kubectl get pods -n kube-system and verify that the evictions stop and pods return to a running state. This may take several minutes.
- Monitor your prelude pods with watch kubectl get pods -n prelude to confirm the prelude pods are starting. This will also take several minutes.
Disk Latency
- Move the VMware Aria Automation or Automation Orchestrator appliances to Storage that can meet the Maximum Storage Latency requirements as defined by the official product documentation.
Workaround:
Pods stuck in Pending state
It is possible the pods will still stay in pending an not restart on their own. If this occurs, there are a few situations that may cause this:- If the disk was completely full on one of the nodes, it's possible that the docker images corrupted or otherwise encountered an issue.
- There are problems in the kube-system pods.
- Check for fluentd service via "systemctl status fluentd" and see if it's healthy. If this VRA is 8.1 or older, it will likely be "service fluentd status" instead. Restart the service if needed via "systemctl restart fluentd" (VRA 8.2+) or "service fluentd restart" (VRA8.1 and below).
- If service is not restarting properly, run "/opt/scripts/restore-docker-images.sh" on all VRA nodes.
- Run
kubectl get pods -n kube-system
- Check for any pods that are in non-running or completed states (e.g. "container-creating" or "error")
- Run the below command to rebuild them if they are in a non-running or non-completed state and wait for this process to complete
kubectl delete pods -n kube-system podName
Procedure to Restart Services
- Run
/opt/scripts/deploy.sh --shutdown
- Monitor pods in a separate terminal window to confirm they tear down successfully.
- Run
/opt/scripts/deploy.sh
Additional Information
VMware Aria Automation or Automation Orchestrator services become inaccessible from the web interface and you can no longer login.
Comments
Post a Comment