Advanced VCF 5.x Troubleshooting Tips & Issues (Deep‑Dive)
Below are expert‑level, field‑proven troubleshooting scenarios for VMware Cloud Foundation 5.x — beyond the basics — with real-world root causes, log references, workflow behavior, and remediation paths.
⚠️ 1. Advanced Bring‑Up (Cloud Builder) Failures
A. Cloud Builder Fails Due to Corrupted OVA/ISO or Host Image
Even if deployment “seems” successful, a slightly corrupted ESXi ISO or Cloud Builder appliance causes silent failures in later bring‑up phases.
- VMware recommends verifying CRC/MD5/SHA‑256 hash values against published Broadcom values.
This is emphasized in bring‑up troubleshooting, including Cloud Builder and ESXi ISO integrity checks.
Fix
Run SHA validation before deployment:
B. Hostname/FQDN Mis‑Entries in Deployment Workbook
VCF bring‑up fails if the workbook contains fully qualified names for ESXi hosts instead of short hostnames — even if validation appears to pass.
This exact failure scenario is documented in advanced bring‑up troubleshooting guides.
Symptom: Download SSH Keys using Guest Program for vCenter fails.
C. MTU Path Inconsistency Check Failure
VCF bring‑up performs hop‑by‑hop MTU validation.
- If a vSAN/vMotion path has mixed MTU (e.g., 1500 → 9000 → 1500), deployment validation fails.
- Documented as a common bring‑up blocker.
Fix: Use:
2. Advanced SDDC Manager / LCM Workflow Failures
A. LCM Service Crash During Upgrade Workflows
Broadcom KBs identify LCM crashes triggered by:
- Bad inventory states
- Missing NSX/vCenter prechecks
- Version mismatch within BOM (Bill of Materials)
These issues are specifically documented in critical VCF 5.1.x KBs.
Fix
Check /var/log/vmware/vcf/lcm/lcm.log for:
LCM Failed: bundleMismatchException
B. Async Patch Tool Required for Out‑of‑Band Environments
VCF cannot upgrade properly if administrators manually patched vCenter/ESXi outside the LCM workflow.
Broadcom provides the Async Patch Tool to correct this state.
[aroracloud.com]
C. Stuck Workflow States in SDDC Manager
Symptoms:
- Tasks stuck at "In Progress"
- SDDC Manager UI hangs
Advanced troubleshooting guides recommend using the SoS execution history to identify root workflow states:
./sos --history
🔐 3. Advanced Access, Authentication & Account Issues
A. Local Account Lockouts Affecting API & LCM
VCF 5.x SDDC Manager tightly integrates SSO + local system accounts.
If local accounts lock (very common post-upgrade), multiple dependent workflows fail.
Official remediation:
Restart the commonsvcs service.
systemctl restart commonsvcs
This takes ~5 minutes to fully restore services (not instant).
B. SDDC Manager Token Corruption After Stale Sessions
Symptoms:
- “401 Unauthorized”
- LCM bundle import fails
- SDDC Manager UI login redirect loops
Fix: Clear token cache + restart SDDC services.
🌐 4. NSX‑T Advanced Issues in VCF 5.x
A. NSX Federation Compatibility Pitfalls
VCF 5.x upgrade path requires strict NSX Federation version alignment.
Broadcom references highlight misaligned NSX Federation builds as a major cause of upgrade and deployment failures.
B. Edge Deployment Fails During Workload Domain Creation
Root causes:
- Missing Tier‑0 uplinks
- Incorrect Edge Cluster profile
- MTU inconsistency
- Missing DNS forward/reverse entries for Edge nodes
C. Certificate Failures Breaking NSX/SDDC Manager Communication
Symptoms:
- Edges enter “Unknown” state
- LCM stops managing NSX upgrades
- API connections break
Fix: Rotate NSX certificates and resync with SDDC Manager.
🧱 5. Workload Domain Creation – Expert-Level Failures
A. vCenter Deployment Fails Mid‑Workflow
Because VCF automatically deploys a new vCenter for each WLD, failures occur if:
- DHCP leaks conflict with temp bring-up IPs
- DNS forward/reverse missing
- Appliance size mismatched to hosts
VCF troubleshooting documentation highlights this.
B. ESXi Host Commissioning Fails Even If Ping Succeeds
Causes:
- Wrong short name vs FQDN
- Thumbprint mismatch
- Host already part of another vCenter
- NTP offset > 5 seconds
📁 6. Advanced Log Analysis (Critical Logs)
Run SoS with component‑only collection
To reduce timeout risk for large environments:
./sos --sddc-manager-logs
./sos --vc-logs
./sos --nsx-logs
./sos --domain-name <domain>
Cloud Builder Deep Logs
/var/log/vmware/vcf/bringup/
Highly useful for:
- Validation failures
- Scripted phase breakdowns
Aria Lifecycle (Automation) Logs
/var/log/vrlcm/vmware_vrlcm.log
⚙️ 7. Infrastructure-Level Advanced Tips
A. Validate BOM Compatibility Before ANY Change
Release notes emphasize BOM review before:
- Upgrades
- Patch import
- Adding hosts
This is the #1 cause of LCM failures.
1. Common Issues in VCF 5.x
1. Bring‑Up / Deployment Failures
Typical causes:
- DNS/NTP misconfiguration
- Wrong hostnames (e.g., FQDN used instead of short name)
- Networking MTU/VLAN/trunking issues
- Bad ISO or Cloud Builder appliance corruption (CRC mismatch)
Source: Networking, hash validation, and hostname issues are frequently cited in VCF deployment troubleshooting guides.
Tip: Always validate SHA256/MD5 of Cloud Builder + ESXi ISO to avoid silent corruption.
2. SDDC Manager: Lifecycle Management (LCM) Crashes
Reported in VCF 5.1.x and 5.2.x during:
- Inventory sync
- Upgrade bundle extraction
- NSX upgrades
Broadcom notes specific KBs addressing LCM service crash scenarios.
3. vCenter / ESXi Version Mismatch
Mixed versions (e.g., vCenter 8.x with ESXi 7.x during upgrade waves) cause:
- ELM issues
- LCM workflow failures
- Pre-check errors
Broadcom KB highlights limitations of mixed vCenter versions during VCF 5.x upgrades.
4. NSX‑T Issues in VCF Domains
Common problems include:
- Edge deployment failure
- Federation incompatibility
- T0/T1 route advertisement issues
- Certificate trust failures
Upgrading with NSX Federation requires following specific guidance.
5. Account Lockouts (VCF Local Accounts)
If SDDC Manager UI login fails:
- Local account may be locked
- Fix by restarting
commonsvcs
Procedure is documented in troubleshooting guides:
systemctl restart commonsvcs
6. Workload Domain Creation Failures
Typical root causes:
- vCenter deployment failure
- ESXi commissioning issues
- NSX integration errors
- Wrong network profiles
Training and troubleshooting references confirm these patterns.
🔧 2. Troubleshooting Methodology (VCF 5.x)
1. Always Start with SoS Utility
The Supportability and Serviceability (SoS) tool is the central diagnostic tool.
Run from SDDC Manager appliance:
cd /opt/vmware/sddc-support
./sos --help
Available options include:
--vc-logs--esx-logs--sddc-manager-logs--nsx-logs--collect-all-logs
SoS log collection timeouts may occur in large deployments.
2. Validate BOM Versions
Always confirm components match VCF Bill of Materials to avoid:
- Bring‑up failures
- LCM crashes
- NSX/vCenter mismatches
Release notes strongly emphasize this.
3. Re-check Networking Prerequisites
Common root causes:
- MTU mismatch
- Incorrect VLANs
- Trunk misconfiguration
- DNS forward/reverse resolution fails
Confirmed in multiple troubleshooting sources.
4. Restart Key Services for Access/Workflow Issues
Restart commonsvcs (account lockout fix)
systemctl restart commonsvcs
Restart SDDC Manager services
systemctl restart sddc-manager
Restart LCM
systemctl restart lcm
5. Validate Deployment Parameters (Critical for Bring‑Up)
Wrong hostnames in Deployment Workbook cause failures.
📁 3. Log Locations for VCF 5.x
A. SDDC Manager Logs
| Component | Path |
|---|---|
| SDDC Manager Logs | /var/log/vmware/vcf/sddc-manager/ |
| Domain Manager | /var/log/vmware/vcf/domainmanager/domainmanager.log |
| LCM Logs | /var/log/vmware/vcf/lcm/lcm.log |
| Workflow Logs | /var/log/vmware/vcf/automation-framework/ |
B. vCenter / ESXi Logs (Through SoS)
SoS Options:
--vc-logs(vCenter logs) [techdocs.b...oadcom.com]--esx-logs(ESXi logs) [techdocs.b...oadcom.com]
Examples:
- vCenter:
/var/log/vmware/vpxd/vpxd.log - ESXi:
/var/log/hostd.log,/var/log/vmkernel.log
C. NSX Logs
Collected via:
./sos --nsx-logs
NSX Manager location (typical):
/var/log/syslog.log
/var/log/vmware/nsx-manager/
D. Cloud Builder Appliance Logs
SoS collects:
- Cloud Builder logs
- API logs
- Validation logs
Critical path:
/var/log/vmware/vcf/bringup/
E. Aria Lifecycle / VCF Automation Logs
/var/log/vrlcm/vmware_vrlcm.log
🛠 4. Quick Commands for VCF Troubleshooting
Run SoS with all logs
./sos --collect-all-logs
SoS history
./sos --history
SoS debug mode
🎯 5. Summary of Key Issues in VCF 5.x
| Area | Symptoms | Cause |
|---|---|---|
| Bring‑up | Validation failures | DNS/NTP, wrong hostnames, corrupt ISO |
| LCM | Crashes / stuck tasks | Build mismatch, NSX incompatibility |
| NSX | Edges fail, routing breaks | Federation issues, certificates |
| SDDC Manager | Login fails | Locked account / commonsvcs failure |
| Workload Domains | Deployment fails | vCenter/NSX config mismatch |
Comments
Post a Comment