🔷 vSphere Fault Tolerance (FT) – Detailed Explanation
In VMware vSphere, Fault Tolerance (FT) provides continuous availability for virtual machines by eliminating downtime during host failures.
Unlike HA (which restarts VMs after failure), FT ensures zero downtime and zero data loss.
🔹 1️⃣ What is vSphere FT?
vSphere FT creates:
-
Primary VM (Active)
-
Secondary VM (Shadow copy)
Both run simultaneously on different ESXi hosts.
If the primary host fails:
✔ Secondary VM immediately becomes active
✔ No reboot
✔ No service interruption
✔ No transaction loss
This is called Continuous Availability.
🔹 2️⃣ How FT Works (Architecture)
Components Involved:
-
ESXi Hosts (minimum 2)
-
Shared Storage (SAN / vSAN / NFS)
-
vMotion network
-
FT Logging network (very important)
-
vCenter Server
Process Flow:
-
VM powered ON
-
FT enabled
-
vSphere creates a secondary VM
-
CPU & memory execution state is mirrored in real time
-
All instructions are logged and replayed
If Host A fails → Host B already has synchronized state → Takes over instantly.
🔹 3️⃣ Key Technical Concept – Lockstep Technology
FT uses vLockstep technology:
-
Every CPU instruction executed on primary
-
Simultaneously replayed on secondary
-
Memory changes continuously synchronized
-
Network and disk I/O preserved
Result:
Zero transaction loss.
🔹 4️⃣ HA vs FT Comparison
| Feature | HA | FT |
|---|---|---|
| Downtime | 2–5 minutes | 0 seconds |
| VM Reboot | Yes | No |
| Data Loss | Possible (few transactions) | None |
| Resource Usage | Normal | Double (two VMs) |
| Use Case | Most workloads | Mission critical |
🔹 5️⃣ FT Requirements
Infrastructure Requirements:
✔ Shared storage
✔ Compatible CPUs (EVC recommended)
✔ vMotion enabled
✔ Dedicated FT logging network
✔ Minimum 1Gbps (10Gbps recommended)
🔹 6️⃣ FT Limitations (Enterprise Awareness)
-
Not suitable for heavy DB workloads (high CPU)
-
Limited vCPU support (depends on vSphere version)
-
Requires double compute resources
-
Snapshot restrictions
-
Some backup tools limitations
🔹 7️⃣ Enterprise Use Cases
Ideal for:
-
Domain Controllers
-
DNS servers
-
License servers
-
Small but critical applications
-
Healthcare PACS controllers
-
Financial transaction gateways
Not ideal for:
-
Large Oracle DB
-
SAP HANA
-
Heavy analytics
🔹 8️⃣ FT States Explained
-
Protected
-
Not Protected
-
Secondary VM Running
-
Need Secondary
-
Disabled
Monitoring FT health is critical in production.
🔹 9️⃣ Failure Scenarios
Scenario 1: Host Failure
Secondary instantly becomes primary
New secondary created on another host
Scenario 2: Network Failure (FT Logging)
Protection lost but VM continues running
Alert generated
Scenario 3: Storage Failure
FT cannot protect against shared storage failure
Need Storage HA / vSAN redundancy
🔹 🔟 Performance Impact
FT consumes:
-
Extra CPU cycles
-
Memory overhead
-
Logging bandwidth
Best practice:
Enable FT only for Tier-0 workloads.
🔹 1️⃣1️⃣ Advanced Enterprise Design Example
In 12-host cluster:
-
2 hosts reserved for FT heavy workloads
-
Separate 10Gb FT logging network
-
Admission control adjusted
-
Monitor FT latency (<10ms recommended)
🔹 1️⃣2️⃣ Interview-Ready Explanation
“vSphere FT provides continuous availability by creating a live shadow VM that mirrors CPU and memory execution in real time. In case of host failure, the secondary VM immediately takes over without reboot, ensuring zero downtime and zero data loss.”
🔹 Final Summary
FT = Continuous Availability
HA = Restart-Based Recovery
FT is powerful but resource-intensive.
Use it only for business-critical, low-to-medium workload VMs.
If you'd like next, I can provide:
-
FT architecture diagram for presentation
-
FT troubleshooting scenarios
-
FT vs Clustering comparison
-
Enterprise FT deployment checklist
🔷 Split-Brain Scenario in vSphere FT
In VMware vSphere Fault Tolerance (FT), a split-brain situation occurs when both the Primary VM and Secondary VM believe they are the active instance due to a communication breakdown.
This is a rare but critical design scenario.
🔹 1️⃣ What is Split Brain?
In FT:
-
Primary VM runs on Host A
-
Secondary VM runs on Host B
-
Both are synchronized through the FT Logging network
If communication between hosts is lost (but both hosts are still running), each host may think the other has failed.
Result:
⚠ Risk of two active VMs running simultaneously.
This can cause:
-
Data corruption
-
Duplicate transactions
-
IP conflicts
-
Database inconsistencies
🔹 2️⃣ Why Split Brain Happens
Most common causes:
-
FT logging network failure
-
vMotion network isolation
-
Host isolation from management network
-
Network partition in cluster
-
Misconfigured gateway / VLAN
Important:
Split brain is network-driven, not storage-driven.
🔹 3️⃣ How vSphere Prevents Split Brain
VMware uses a mechanism called Primary/Secondary election with datastore heartbeat lock.
Protection mechanisms include:
✅ A. Datastore Locking
Both VMs cannot write simultaneously to the same disk.
VMFS locking ensures only one VM remains active.
✅ B. Host Isolation Response
Configured via HA:
-
Leave powered on
-
Power off
-
Shut down
Correct configuration prevents dual active state.
✅ C. FT Role Re-election
If communication lost:
-
vCenter detects issue
-
HA determines authoritative primary
-
One VM is stopped automatically
✅ D. FT Heartbeat Mechanism
Continuous heartbeat via:
-
Management network
-
FT logging network
-
Datastore heartbeat
Multiple checks reduce false split.
🔹 4️⃣ Example Split Brain Scenario
Situation:
-
Host A and Host B running FT pair
-
FT logging network fails
-
Management network still up
-
Storage accessible to both
What happens:
-
Secondary stops receiving execution stream
-
It does NOT immediately become primary
-
HA evaluates host isolation
-
One VM is powered off to avoid conflict
VMware always prioritizes data integrity over availability
🔹 5️⃣ Dangerous Scenario (Improper Network Design)
If:
-
FT logging network isolated
-
Management network isolated
-
Datastore heartbeat unavailable
Then:
Each host may assume the other failed.
Modern versions include additional safeguards, but poor design increases risk.
🔹 6️⃣ Enterprise Best Practices to Avoid Split Brain
✔ Dedicated FT logging network (10Gb recommended)
✔ Redundant physical NICs
✔ Separate VLAN for FT traffic
✔ Enable datastore heartbeat
✔ Configure proper isolation response
✔ Avoid single point of network failure
✔ Monitor FT logging latency
🔹 7️⃣ Interview-Ready Explanation
“In vSphere FT, split brain can occur if communication between primary and secondary VMs is lost, potentially causing both to assume active state. VMware prevents this using datastore locking, heartbeats, and HA isolation responses to ensure only one VM remains active.”
🔹 8️⃣ Important Clarification
FT split brain is very rare in properly designed clusters.
Modern vSphere versions heavily mitigate it using:
-
Multiple heartbeat channels
-
Automatic conflict resolution
-
Host election logic
🔹 Final Takeaway
Split Brain Risk = Network Misconfiguration
Protection = Heartbeats + Locking + HA Policies
Design correctly → Risk nearly eliminated.
- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Comments
Post a Comment