Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering
Issue/Introduction
Provide instructions on how to monitor and restore a 3 node vPostgres cluster within Kubernetes containers.
Symptoms:
Symptoms:
- The command vracli status shows multiple primary database nodes.
- vPostgres is unable to elect a single master node.
- prelude-noop-intnet-netcheck.log files within pods/kube-system/prelude-noop-intnet-ds-***** directories have entries similar to the below
2019/12/31 08:27:04 Failed ping for 10.244.2.2, packet loss is 100.000000
2019/12/31 08:27:04 Failed ping for 10.244.1.5, packet loss is 100.000000
2019/12/31 08:27:04 Pinging the majority of nodes failed.
2019/12/31 08:27:04 Failed ping for 10.244.1.5, packet loss is 100.000000
2019/12/31 08:27:04 Pinging the majority of nodes failed.
- 3 node vRealize Automation 8.0 / 8.0.1 cluster does not have redundant network pathing
Cause
3 node vPostgres clustering can breakdown due to network isolation / connectivity creating a split-brain scenario of 3 running master databases.
Resolution
Resiliency improvements will be introduced in vRealize Automation 8.1 to prevent this scenario from occurring.
Workaround:
Ensure that valid snapshots have been taken prior to performing any actions. Do not create live snapshots. For vRealize Automation 8.0, ensure cold powered down snapshots are performed per: vRealize Automation 8.x Preparations for Backing Up
It is highly encouraged to have a stringent backup procedure available on a daily schedule.
It is recommended to have redundant network pathing between ESXi hosts in which host the vRealize Automation appliance nodes.
Workaround:
- On any of the vRealize Automation virtual appliance(s) run the following command once
vracli cluster exec -- touch /data/db/live/debug
Note: This will create a flag file on all cluster nodes that will then pause the database pods when they start so they can then be worked with manually.
- Restart the postgres-1 and postgres-2 pods
kubectl delete pod -n prelude postgres-1; kubectl delete pod -n prelude postgres-2;
Note: This will restart the postgres-1 and postgres-2 pods. Due to the debug flags, they will stop then wait instead of starting vPostgres
- Identify the node on which postgres-1 and postgres-2 pods are now running
kubectl get pods -n prelude -l name=postgres -o wide
- On the node where postgres-1 is running execute the following command to remove the debug flags
rm /data/db/live/debug
- Run
kubectl -n prelude logs -f postgres-1
- Monitor the logs and ensure that postgres-1 discovers postgres-0 as a primary, re-syncs from it and starts working. A message similar to the below will be reported if successful
'[repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state' on the postgres-1 log.
- Repeat the same for postgres-2
- Finally, remove the /data/db/live/debug file on the node where postgres-0 is running
Comments
Post a Comment