Skip to main content

vSphere Fault Tolerance (FT)

 

🔷 vSphere Fault Tolerance (FT) – Detailed Explanation

In VMware vSphere, Fault Tolerance (FT) provides continuous availability for virtual machines by eliminating downtime during host failures.

Unlike HA (which restarts VMs after failure), FT ensures zero downtime and zero data loss.


🔹 1️⃣ What is vSphere FT?

vSphere FT creates:

  • Primary VM (Active)

  • Secondary VM (Shadow copy)

Both run simultaneously on different ESXi hosts.

If the primary host fails:
✔ Secondary VM immediately becomes active
✔ No reboot
✔ No service interruption
✔ No transaction loss

This is called Continuous Availability.


🔹 2️⃣ How FT Works (Architecture)

Components Involved:

  • ESXi Hosts (minimum 2)

  • Shared Storage (SAN / vSAN / NFS)

  • vMotion network

  • FT Logging network (very important)

  • vCenter Server

Process Flow:

  1. VM powered ON

  2. FT enabled

  3. vSphere creates a secondary VM

  4. CPU & memory execution state is mirrored in real time

  5. All instructions are logged and replayed

If Host A fails → Host B already has synchronized state → Takes over instantly.


🔹 3️⃣ Key Technical Concept – Lockstep Technology

FT uses vLockstep technology:

  • Every CPU instruction executed on primary

  • Simultaneously replayed on secondary

  • Memory changes continuously synchronized

  • Network and disk I/O preserved

Result:
Zero transaction loss.


🔹 4️⃣ HA vs FT Comparison

FeatureHAFT
Downtime2–5 minutes0 seconds
VM RebootYesNo
Data LossPossible (few transactions)None
Resource UsageNormalDouble (two VMs)
Use CaseMost workloadsMission critical

🔹 5️⃣ FT Requirements

Infrastructure Requirements:

✔ Shared storage
✔ Compatible CPUs (EVC recommended)
✔ vMotion enabled
✔ Dedicated FT logging network
✔ Minimum 1Gbps (10Gbps recommended)


🔹 6️⃣ FT Limitations (Enterprise Awareness)

  • Not suitable for heavy DB workloads (high CPU)

  • Limited vCPU support (depends on vSphere version)

  • Requires double compute resources

  • Snapshot restrictions

  • Some backup tools limitations


🔹 7️⃣ Enterprise Use Cases

Ideal for:

  • Domain Controllers

  • DNS servers

  • License servers

  • Small but critical applications

  • Healthcare PACS controllers

  • Financial transaction gateways

Not ideal for:

  • Large Oracle DB

  • SAP HANA

  • Heavy analytics


🔹 8️⃣ FT States Explained

  • Protected

  • Not Protected

  • Secondary VM Running

  • Need Secondary

  • Disabled

Monitoring FT health is critical in production.


🔹 9️⃣ Failure Scenarios

Scenario 1: Host Failure

Secondary instantly becomes primary
New secondary created on another host

Scenario 2: Network Failure (FT Logging)

Protection lost but VM continues running
Alert generated

Scenario 3: Storage Failure

FT cannot protect against shared storage failure
Need Storage HA / vSAN redundancy


🔹 🔟 Performance Impact

FT consumes:

  • Extra CPU cycles

  • Memory overhead

  • Logging bandwidth

Best practice:
Enable FT only for Tier-0 workloads.


🔹 1️⃣1️⃣ Advanced Enterprise Design Example

In 12-host cluster:

  • 2 hosts reserved for FT heavy workloads

  • Separate 10Gb FT logging network

  • Admission control adjusted

  • Monitor FT latency (<10ms recommended)


🔹 1️⃣2️⃣ Interview-Ready Explanation

“vSphere FT provides continuous availability by creating a live shadow VM that mirrors CPU and memory execution in real time. In case of host failure, the secondary VM immediately takes over without reboot, ensuring zero downtime and zero data loss.”


🔹 Final Summary

FT = Continuous Availability
HA = Restart-Based Recovery

FT is powerful but resource-intensive.
Use it only for business-critical, low-to-medium workload VMs.


If you'd like next, I can provide:

  • FT architecture diagram for presentation

  • FT troubleshooting scenarios

  • FT vs Clustering comparison

  • Enterprise FT deployment checklist

split brain scenerio of FT?

🔷 Split-Brain Scenario in vSphere FT

In VMware vSphere Fault Tolerance (FT), a split-brain situation occurs when both the Primary VM and Secondary VM believe they are the active instance due to a communication breakdown.

This is a rare but critical design scenario.


🔹 1️⃣ What is Split Brain?

In FT:

  • Primary VM runs on Host A

  • Secondary VM runs on Host B

  • Both are synchronized through the FT Logging network

If communication between hosts is lost (but both hosts are still running), each host may think the other has failed.

Result:
⚠ Risk of two active VMs running simultaneously.

This can cause:

  • Data corruption

  • Duplicate transactions

  • IP conflicts

  • Database inconsistencies


🔹 2️⃣ Why Split Brain Happens

Most common causes:

  1. FT logging network failure

  2. vMotion network isolation

  3. Host isolation from management network

  4. Network partition in cluster

  5. Misconfigured gateway / VLAN

Important:
Split brain is network-driven, not storage-driven.


🔹 3️⃣ How vSphere Prevents Split Brain

VMware uses a mechanism called Primary/Secondary election with datastore heartbeat lock.

Protection mechanisms include:

✅ A. Datastore Locking

Both VMs cannot write simultaneously to the same disk.
VMFS locking ensures only one VM remains active.


✅ B. Host Isolation Response

Configured via HA:

  • Leave powered on

  • Power off

  • Shut down

Correct configuration prevents dual active state.


✅ C. FT Role Re-election

If communication lost:

  • vCenter detects issue

  • HA determines authoritative primary

  • One VM is stopped automatically


✅ D. FT Heartbeat Mechanism

Continuous heartbeat via:

  • Management network

  • FT logging network

  • Datastore heartbeat

Multiple checks reduce false split.


🔹 4️⃣ Example Split Brain Scenario

Situation:

  • Host A and Host B running FT pair

  • FT logging network fails

  • Management network still up

  • Storage accessible to both

What happens:

  1. Secondary stops receiving execution stream

  2. It does NOT immediately become primary

  3. HA evaluates host isolation

  4. One VM is powered off to avoid conflict

VMware always prioritizes data integrity over availability


🔹 5️⃣ Dangerous Scenario (Improper Network Design)

If:

  • FT logging network isolated

  • Management network isolated

  • Datastore heartbeat unavailable

Then:
Each host may assume the other failed.

Modern versions include additional safeguards, but poor design increases risk.


🔹 6️⃣ Enterprise Best Practices to Avoid Split Brain

✔ Dedicated FT logging network (10Gb recommended)
✔ Redundant physical NICs
✔ Separate VLAN for FT traffic
✔ Enable datastore heartbeat
✔ Configure proper isolation response
✔ Avoid single point of network failure
✔ Monitor FT logging latency


🔹 7️⃣ Interview-Ready Explanation

“In vSphere FT, split brain can occur if communication between primary and secondary VMs is lost, potentially causing both to assume active state. VMware prevents this using datastore locking, heartbeats, and HA isolation responses to ensure only one VM remains active.”


🔹 8️⃣ Important Clarification

FT split brain is very rare in properly designed clusters.

Modern vSphere versions heavily mitigate it using:

  • Multiple heartbeat channels

  • Automatic conflict resolution

  • Host election logic


🔹 Final Takeaway

Split Brain Risk = Network Misconfiguration
Protection = Heartbeats + Locking + HA Policies

Design correctly → Risk nearly eliminated.




Comments

Popular posts from this blog

Quick Guide to VCF Automation for VCD Administrators

  Quick Guide to VCF Automation for VCD Administrators VMware Cloud Foundation 9 (VCF 9) has been  released  and with it comes brand new Cloud Management Platform –  VCF Automation (VCFA)  which supercedes both Aria Automation and VMware Cloud Director (VCD). This blog post is intended for those people that know VCD quite well and want to understand how is VCFA similar or different to help them quickly orient in the new direction. It should be emphasized that VCFA is a new solution and not just rebranding of an old one. However it reuses a lot of components from its predecessors. The provider part of VCFA called Tenenat Manager is based on VCD code and the UI and APIs will be familiar to VCD admins, while the tenant part inherist a lot from Aria Automation and especially for VCD end-users will look brand new. Deployment and Architecture VCFA is generaly deployed from VCF Operations Fleet Management (former Aria Suite LCM embeded in VCF Ops. Fleet Management...
  Issue with Aria Automation Custom form Multi Value Picker and Data Grid https://knowledge.broadcom.com/external/article?articleNumber=345960 Products VMware Aria Suite Issue/Introduction Symptoms: Getting  error " Expected Type String but was Object ", w hen trying to use Complex Types in MultiValue Picker on the Aria for Automation Custom Form. Environment VMware vRealize Automation 8.x Cause This issue has been identified where the problem appears when a single column Multi Value Picker or Data Grid is used. Resolution This is a known issue. There is a workaround.  Workaround: As a workaround, try adding one empty column in the Multivalue picker without filling the options. So we can add one more column without filling the value which will be hidden(there is a button in the designer page that will hide the column). This way the end user will receive the same view.  

Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware

 🔹 Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware ⸻ 1️⃣ Memory Ballooning (vmmemctl) Ballooning is the first memory reclamation technique used when ESXi detects memory pressure. ➤ Step-by-Step: How Ballooning Works  1. VMware Tools installs the balloon driver (vmmemctl) inside the guest OS.  2. ESXi detects low free memory on the host.  3. ESXi inflates the balloon in selected VMs.  4. Balloon driver occupies guest memory, making the OS think RAM is full.  5. Guest OS frees idle / unused pages (because it believes memory is needed).  6. ESXi reclaims those freed pages and makes them available to other VMs. Why Ballooning Happens?  • Host free memory is very low.  • ESXi wants the VM to release unused pages before resorting to swapping. Example  • Host memory: 64 GB  • VMs used: 62 GB  • Free: 2 GB → ESXi triggers ballooning  • VM1 (8 GB RAM): Balloon inflates to 2 GB → OS frees 2 GB → ESXi re...