Skip to main content

 

Network isolation causes split-brain scenario in a 3 node cluster: Resetting vPostgres clustering


Issue/Introduction

Provide instructions on how to monitor and restore a 3 node vPostgres cluster within Kubernetes containers.

Symptoms:
  • The command vracli status shows multiple primary database nodes.
  • vPostgres is unable to elect a single master node.
  • prelude-noop-intnet-netcheck.log files within pods/kube-system/prelude-noop-intnet-ds-***** directories have entries similar to the below
2019/12/31 08:27:04 Failed ping for 10.244.2.2, packet loss is 100.000000
2019/12/31 08:27:04 Failed ping for 10.244.1.5, packet loss is 100.000000
2019/12/31 08:27:04 Pinging the majority of nodes failed.
  • 3 node vRealize Automation 8.0 / 8.0.1 cluster does not have redundant network pathing


Cause

3 node vPostgres clustering can breakdown due to network isolation / connectivity creating a split-brain scenario of 3 running master databases.

Resolution

Resiliency improvements will be introduced in vRealize Automation 8.1 to prevent this scenario from occurring.

 


Workaround:
Ensure that valid snapshots have been taken prior to performing any actions.  Do not create live snapshots.  For vRealize Automation 8.0, ensure cold powered down snapshots are performed per:  vRealize Automation 8.x Preparations for Backing Up

It is highly encouraged to have a stringent backup procedure available on a daily schedule.

It is recommended to have redundant network pathing between ESXi hosts in which host the vRealize Automation appliance nodes.

Workaround:

  1. On any of the vRealize Automation virtual appliance(s) run the following command once
    vracli cluster exec -- touch /data/db/live/debug
Note: This will create a flag file on all cluster nodes that will then pause the database pods when they start so they can then be worked with manually.
  1. Restart the postgres-1 and postgres-2 pods
    kubectl delete pod -n prelude postgres-1; kubectl delete pod -n prelude postgres-2;
Note: This will restart the postgres-1 and postgres-2 pods. Due to the debug flags, they will stop then wait instead of starting vPostgres
  1. Identify the node on which postgres-1 and postgres-2 pods are now running
    kubectl get pods -n prelude -l name=postgres -o wide
  2. On the node where postgres-1 is running execute the following command to remove the debug flags
    rm /data/db/live/debug
  3. Run
    kubectl -n prelude logs -f postgres-1
  4. Monitor the logs and ensure that postgres-1 discovers postgres-0 as a primary, re-syncs from it and starts working. A message similar to the below will be reported if successful
    '[repmgrd] monitoring primary node "postgres-0.postgres.prelude.svc.cluster.local" (ID: 100) in normal state' on the postgres-1 log.
  5. Repeat the same for postgres-2
  6. Finally, remove the /data/db/live/debug file on the node where postgres-0 is running

Comments

Popular posts from this blog

  Issue with Aria Automation Custom form Multi Value Picker and Data Grid https://knowledge.broadcom.com/external/article?articleNumber=345960 Products VMware Aria Suite Issue/Introduction Symptoms: Getting  error " Expected Type String but was Object ", w hen trying to use Complex Types in MultiValue Picker on the Aria for Automation Custom Form. Environment VMware vRealize Automation 8.x Cause This issue has been identified where the problem appears when a single column Multi Value Picker or Data Grid is used. Resolution This is a known issue. There is a workaround.  Workaround: As a workaround, try adding one empty column in the Multivalue picker without filling the options. So we can add one more column without filling the value which will be hidden(there is a button in the designer page that will hide the column). This way the end user will receive the same view.  

57 Tips Every Admin Should Know

Active Directory 1. To quickly list all the groups in your domain, with members, run this command: dsquery group -limit 0 | dsget group -members –expand 2. To find all users whose accounts are set to have a non-expiring password, run this command: dsquery * domainroot -filter “(&(objectcategory=person)(objectclass=user)(lockoutTime=*))” -limit 0 3. To list all the FSMO role holders in your forest, run this command: netdom query fsmo 4. To refresh group policy settings, run this command: gpupdate 5. To check Active Directory replication on a domain controller, run this command: repadmin /replsummary 6. To force replication from a domain controller without having to go through to Active Directory Sites and Services, run this command: repadmin /syncall 7. To see what server authenticated you (or if you logged on with cached credentials) you can run either of these commands: set l echo %logonserver% 8. To see what account you are logged on as, run this command: ...
  The Guardrails of Automation VMware Cloud Foundation (VCF) 9.0 has redefined private cloud automation. With full-stack automation powered by Ansible and orchestrated through vRealize Orchestrator (vRO), and version-controlled deployments driven by GitOps and CI/CD pipelines, teams can build infrastructure faster than ever. But automation without guardrails is a recipe for risk Enter RBAC and policy enforcement. This third and final installment in our automation series focuses on how to secure and govern multi-tenant environments in VCF 9.0 with role-based access control (RBAC) and layered identity management. VCF’s IAM Foundation VCF 9.x integrates tightly with enterprise identity providers, enabling organizations to define and assign roles using existing Active Directory (AD) groups. With its persona-based access model, administrators can enforce strict boundaries across compute, storage, and networking resources: Personas : Global Admin, Tenant Admin, Contributor, Viewer Projec...