vSphere Fault Tolerance (FT)

🔷 vSphere Fault Tolerance (FT) – Detailed Explanation

In VMware vSphere, Fault Tolerance (FT) provides continuous availability for virtual machines by eliminating downtime during host failures.

Unlike HA (which restarts VMs after failure), FT ensures zero downtime and zero data loss.

🔹 1️⃣ What is vSphere FT?

vSphere FT creates:

Primary VM (Active)
Secondary VM (Shadow copy)

Both run simultaneously on different ESXi hosts.

If the primary host fails:
✔ Secondary VM immediately becomes active
✔ No reboot
✔ No service interruption
✔ No transaction loss

This is called Continuous Availability.

🔹 2️⃣ How FT Works (Architecture)

Components Involved:

ESXi Hosts (minimum 2)
Shared Storage (SAN / vSAN / NFS)
vMotion network
FT Logging network (very important)
vCenter Server

Process Flow:

VM powered ON
FT enabled
vSphere creates a secondary VM
CPU & memory execution state is mirrored in real time
All instructions are logged and replayed

If Host A fails → Host B already has synchronized state → Takes over instantly.

🔹 3️⃣ Key Technical Concept – Lockstep Technology

FT uses vLockstep technology:

Every CPU instruction executed on primary
Simultaneously replayed on secondary
Memory changes continuously synchronized
Network and disk I/O preserved

Result:
Zero transaction loss.

🔹 4️⃣ HA vs FT Comparison

Feature	HA	FT
Downtime	2–5 minutes	0 seconds
VM Reboot	Yes	No
Data Loss	Possible (few transactions)	None
Resource Usage	Normal	Double (two VMs)
Use Case	Most workloads	Mission critical

🔹 5️⃣ FT Requirements

Infrastructure Requirements:

✔ Shared storage
✔ Compatible CPUs (EVC recommended)
✔ vMotion enabled
✔ Dedicated FT logging network
✔ Minimum 1Gbps (10Gbps recommended)

🔹 6️⃣ FT Limitations (Enterprise Awareness)

Not suitable for heavy DB workloads (high CPU)
Limited vCPU support (depends on vSphere version)
Requires double compute resources
Snapshot restrictions
Some backup tools limitations

🔹 7️⃣ Enterprise Use Cases

Ideal for:

Domain Controllers
DNS servers
License servers
Small but critical applications
Healthcare PACS controllers
Financial transaction gateways

Not ideal for:

Large Oracle DB
SAP HANA
Heavy analytics

🔹 8️⃣ FT States Explained

Protected
Not Protected
Secondary VM Running
Need Secondary
Disabled

Monitoring FT health is critical in production.

🔹 9️⃣ Failure Scenarios

Scenario 1: Host Failure

Secondary instantly becomes primary
New secondary created on another host

Scenario 2: Network Failure (FT Logging)

Protection lost but VM continues running
Alert generated

Scenario 3: Storage Failure

FT cannot protect against shared storage failure
Need Storage HA / vSAN redundancy

🔹 🔟 Performance Impact

FT consumes:

Extra CPU cycles
Memory overhead
Logging bandwidth

Best practice:
Enable FT only for Tier-0 workloads.

🔹 1️⃣1️⃣ Advanced Enterprise Design Example

In 12-host cluster:

2 hosts reserved for FT heavy workloads
Separate 10Gb FT logging network
Admission control adjusted
Monitor FT latency (<10ms recommended)

🔹 1️⃣2️⃣ Interview-Ready Explanation

“vSphere FT provides continuous availability by creating a live shadow VM that mirrors CPU and memory execution in real time. In case of host failure, the secondary VM immediately takes over without reboot, ensuring zero downtime and zero data loss.”

🔹 Final Summary

FT = Continuous Availability
HA = Restart-Based Recovery

FT is powerful but resource-intensive.
Use it only for business-critical, low-to-medium workload VMs.

If you'd like next, I can provide:

FT architecture diagram for presentation
FT troubleshooting scenarios
FT vs Clustering comparison
Enterprise FT deployment checklist

split brain scenerio of FT?

🔷 Split-Brain Scenario in vSphere FT

In VMware vSphere Fault Tolerance (FT), a split-brain situation occurs when both the Primary VM and Secondary VM believe they are the active instance due to a communication breakdown.

This is a rare but critical design scenario.

🔹 1️⃣ What is Split Brain?

In FT:

Primary VM runs on Host A
Secondary VM runs on Host B
Both are synchronized through the FT Logging network

If communication between hosts is lost (but both hosts are still running), each host may think the other has failed.

Result:
⚠ Risk of two active VMs running simultaneously.

This can cause:

Data corruption
Duplicate transactions
IP conflicts
Database inconsistencies

🔹 2️⃣ Why Split Brain Happens

Most common causes:

FT logging network failure
vMotion network isolation
Host isolation from management network
Network partition in cluster
Misconfigured gateway / VLAN

Important:
Split brain is network-driven, not storage-driven.

🔹 3️⃣ How vSphere Prevents Split Brain

VMware uses a mechanism called Primary/Secondary election with datastore heartbeat lock.

Protection mechanisms include:

✅ A. Datastore Locking

Both VMs cannot write simultaneously to the same disk.
VMFS locking ensures only one VM remains active.

✅ B. Host Isolation Response

Configured via HA:

Leave powered on
Power off
Shut down

Correct configuration prevents dual active state.

✅ C. FT Role Re-election

If communication lost:

vCenter detects issue
HA determines authoritative primary
One VM is stopped automatically

✅ D. FT Heartbeat Mechanism

Continuous heartbeat via:

Management network
FT logging network
Datastore heartbeat

Multiple checks reduce false split.

🔹 4️⃣ Example Split Brain Scenario

Situation:

Host A and Host B running FT pair
FT logging network fails
Management network still up
Storage accessible to both

What happens:

Secondary stops receiving execution stream
It does NOT immediately become primary
HA evaluates host isolation
One VM is powered off to avoid conflict

VMware always prioritizes data integrity over availability

🔹 5️⃣ Dangerous Scenario (Improper Network Design)

If:

FT logging network isolated
Management network isolated
Datastore heartbeat unavailable

Then:
Each host may assume the other failed.

Modern versions include additional safeguards, but poor design increases risk.

🔹 6️⃣ Enterprise Best Practices to Avoid Split Brain

✔ Dedicated FT logging network (10Gb recommended)
✔ Redundant physical NICs
✔ Separate VLAN for FT traffic
✔ Enable datastore heartbeat
✔ Configure proper isolation response
✔ Avoid single point of network failure
✔ Monitor FT logging latency

🔹 7️⃣ Interview-Ready Explanation

“In vSphere FT, split brain can occur if communication between primary and secondary VMs is lost, potentially causing both to assume active state. VMware prevents this using datastore locking, heartbeats, and HA isolation responses to ensure only one VM remains active.”

🔹 8️⃣ Important Clarification

FT split brain is very rare in properly designed clusters.

Modern vSphere versions heavily mitigate it using:

Multiple heartbeat channels
Automatic conflict resolution
Host election logic

🔹 Final Takeaway

Split Brain Risk = Network Misconfiguration
Protection = Heartbeats + Locking + HA Policies

Design correctly → Risk nearly eliminated.

Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware

🔹 Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware ⸻ 1️⃣ Memory Ballooning (vmmemctl) Ballooning is the first memory reclamation technique used when ESXi detects memory pressure. ➤ Step-by-Step: How Ballooning Works 1. VMware Tools installs the balloon driver (vmmemctl) inside the guest OS. 2. ESXi detects low free memory on the host. 3. ESXi inflates the balloon in selected VMs. 4. Balloon driver occupies guest memory, making the OS think RAM is full. 5. Guest OS frees idle / unused pages (because it believes memory is needed). 6. ESXi reclaims those freed pages and makes them available to other VMs. Why Ballooning Happens? • Host free memory is very low. • ESXi wants the VM to release unused pages before resorting to swapping. Example • Host memory: 64 GB • VMs used: 62 GB • Free: 2 GB → ESXi triggers ballooning • VM1 (8 GB RAM): Balloon inflates to 2 GB → OS frees 2 GB → ESXi re...

Tech-Gen