Skip to main content

Advanced VCF 5.x Troubleshooting Tips & Issues

 

Advanced VCF 5.x Troubleshooting Tips & Issues (Deep‑Dive)

Below are expert‑level, field‑proven troubleshooting scenarios for VMware Cloud Foundation 5.x — beyond the basics — with real-world root causes, log references, workflow behavior, and remediation paths.


⚠️ 1. Advanced Bring‑Up (Cloud Builder) Failures

A. Cloud Builder Fails Due to Corrupted OVA/ISO or Host Image

Even if deployment “seems” successful, a slightly corrupted ESXi ISO or Cloud Builder appliance causes silent failures in later bring‑up phases.

  • VMware recommends verifying CRC/MD5/SHA‑256 hash values against published Broadcom values.
    This is emphasized in bring‑up troubleshooting, including Cloud Builder and ESXi ISO integrity checks.

Fix
Run SHA validation before deployment:


shasum -a 256 VMware-Cloud-Builder.ova
md5 VMware-ESXi.iso

B. Hostname/FQDN Mis‑Entries in Deployment Workbook

VCF bring‑up fails if the workbook contains fully qualified names for ESXi hosts instead of short hostnames — even if validation appears to pass.
This exact failure scenario is documented in advanced bring‑up troubleshooting guides.

Symptom: Download SSH Keys using Guest Program for vCenter fails.


C. MTU Path Inconsistency Check Failure

VCF bring‑up performs hop‑by‑hop MTU validation.

  • If a vSAN/vMotion path has mixed MTU (e.g., 1500 → 9000 → 1500), deployment validation fails.
  • Documented as a common bring‑up blocker.

Fix: Use:

vmkping ++netstack=vxlan -s 8972 -d <target-ip>

2. Advanced SDDC Manager / LCM Workflow Failures

A. LCM Service Crash During Upgrade Workflows

Broadcom KBs identify LCM crashes triggered by:

  • Bad inventory states
  • Missing NSX/vCenter prechecks
  • Version mismatch within BOM (Bill of Materials)

These issues are specifically documented in critical VCF 5.1.x KBs.

Fix
Check /var/log/vmware/vcf/lcm/lcm.log for:

LCM Failed: bundleMismatchException

B. Async Patch Tool Required for Out‑of‑Band Environments

VCF cannot upgrade properly if administrators manually patched vCenter/ESXi outside the LCM workflow.
Broadcom provides the Async Patch Tool to correct this state.
[aroracloud.com]


C. Stuck Workflow States in SDDC Manager

Symptoms:

  • Tasks stuck at "In Progress"
  • SDDC Manager UI hangs

Advanced troubleshooting guides recommend using the SoS execution history to identify root workflow states:

./sos --history


🔐 3. Advanced Access, Authentication & Account Issues

A. Local Account Lockouts Affecting API & LCM

VCF 5.x SDDC Manager tightly integrates SSO + local system accounts.
If local accounts lock (very common post-upgrade), multiple dependent workflows fail.

Official remediation:
Restart the commonsvcs service.

systemctl restart commonsvcs

This takes ~5 minutes to fully restore services (not instant).


B. SDDC Manager Token Corruption After Stale Sessions

Symptoms:

  • “401 Unauthorized”
  • LCM bundle import fails
  • SDDC Manager UI login redirect loops

Fix: Clear token cache + restart SDDC services.


🌐 4. NSX‑T Advanced Issues in VCF 5.x

A. NSX Federation Compatibility Pitfalls

VCF 5.x upgrade path requires strict NSX Federation version alignment.

Broadcom references highlight misaligned NSX Federation builds as a major cause of upgrade and deployment failures.


B. Edge Deployment Fails During Workload Domain Creation

Root causes:

  • Missing Tier‑0 uplinks
  • Incorrect Edge Cluster profile
  • MTU inconsistency
  • Missing DNS forward/reverse entries for Edge nodes

C. Certificate Failures Breaking NSX/SDDC Manager Communication

Symptoms:

  • Edges enter “Unknown” state
  • LCM stops managing NSX upgrades
  • API connections break

Fix: Rotate NSX certificates and resync with SDDC Manager.


🧱 5. Workload Domain Creation – Expert-Level Failures

A. vCenter Deployment Fails Mid‑Workflow

Because VCF automatically deploys a new vCenter for each WLD, failures occur if:

  • DHCP leaks conflict with temp bring-up IPs
  • DNS forward/reverse missing
  • Appliance size mismatched to hosts

VCF troubleshooting documentation highlights this.


B. ESXi Host Commissioning Fails Even If Ping Succeeds

Causes:

  • Wrong short name vs FQDN
  • Thumbprint mismatch
  • Host already part of another vCenter
  • NTP offset > 5 seconds

📁 6. Advanced Log Analysis (Critical Logs)

Run SoS with component‑only collection

To reduce timeout risk for large environments:

./sos --sddc-manager-logs
./sos --vc-logs
./sos --nsx-logs
./sos --domain-name <domain>

Cloud Builder Deep Logs

/var/log/vmware/vcf/bringup/

Highly useful for:

  • Validation failures
  • Scripted phase breakdowns

Aria Lifecycle (Automation) Logs

/var/log/vrlcm/vmware_vrlcm.log


⚙️ 7. Infrastructure-Level Advanced Tips

A. Validate BOM Compatibility Before ANY Change

Release notes emphasize BOM review before:

  • Upgrades
  • Patch import
  • Adding hosts

This is the #1 cause of LCM failures.




1. Common Issues in VCF 5.x

1. Bring‑Up / Deployment Failures

Typical causes:

  • DNS/NTP misconfiguration
  • Wrong hostnames (e.g., FQDN used instead of short name)
  • Networking MTU/VLAN/trunking issues
  • Bad ISO or Cloud Builder appliance corruption (CRC mismatch)

Source: Networking, hash validation, and hostname issues are frequently cited in VCF deployment troubleshooting guides. 

Tip: Always validate SHA256/MD5 of Cloud Builder + ESXi ISO to avoid silent corruption. 


2. SDDC Manager: Lifecycle Management (LCM) Crashes

Reported in VCF 5.1.x and 5.2.x during:

  • Inventory sync
  • Upgrade bundle extraction
  • NSX upgrades

Broadcom notes specific KBs addressing LCM service crash scenarios.


3. vCenter / ESXi Version Mismatch

Mixed versions (e.g., vCenter 8.x with ESXi 7.x during upgrade waves) cause:

  • ELM issues
  • LCM workflow failures
  • Pre-check errors

Broadcom KB highlights limitations of mixed vCenter versions during VCF 5.x upgrades.


4. NSX‑T Issues in VCF Domains

Common problems include:

  • Edge deployment failure
  • Federation incompatibility
  • T0/T1 route advertisement issues
  • Certificate trust failures

Upgrading with NSX Federation requires following specific guidance.


5. Account Lockouts (VCF Local Accounts)

If SDDC Manager UI login fails:

  • Local account may be locked
  • Fix by restarting commonsvcs

Procedure is documented in troubleshooting guides:

systemctl restart commonsvcs


6. Workload Domain Creation Failures

Typical root causes:

  • vCenter deployment failure
  • ESXi commissioning issues
  • NSX integration errors
  • Wrong network profiles

Training and troubleshooting references confirm these patterns.


🔧 2. Troubleshooting Methodology (VCF 5.x)

1. Always Start with SoS Utility

The Supportability and Serviceability (SoS) tool is the central diagnostic tool.

Run from SDDC Manager appliance:

cd /opt/vmware/sddc-support
./sos --help

Available options include:

  • --vc-logs
  • --esx-logs
  • --sddc-manager-logs
  • --nsx-logs
  • --collect-all-logs

SoS log collection timeouts may occur in large deployments.

2. Validate BOM Versions

Always confirm components match VCF Bill of Materials to avoid:

  • Bring‑up failures
  • LCM crashes
  • NSX/vCenter mismatches

Release notes strongly emphasize this.

3. Re-check Networking Prerequisites

Common root causes:

  • MTU mismatch
  • Incorrect VLANs
  • Trunk misconfiguration
  • DNS forward/reverse resolution fails

Confirmed in multiple troubleshooting sources.

4. Restart Key Services for Access/Workflow Issues

Restart commonsvcs (account lockout fix)

systemctl restart commonsvcs

Restart SDDC Manager services

systemctl restart sddc-manager

Restart LCM

systemctl restart lcm

5. Validate Deployment Parameters (Critical for Bring‑Up)

Wrong hostnames in Deployment Workbook cause failures.

📁 3. Log Locations for VCF 5.x

A. SDDC Manager Logs

ComponentPath
SDDC Manager Logs/var/log/vmware/vcf/sddc-manager/
Domain Manager/var/log/vmware/vcf/domainmanager/domainmanager.log 
LCM Logs/var/log/vmware/vcf/lcm/lcm.log
Workflow Logs/var/log/vmware/vcf/automation-framework/

B. vCenter / ESXi Logs (Through SoS)

SoS Options:

Examples:

  • vCenter: /var/log/vmware/vpxd/vpxd.log
  • ESXi: /var/log/hostd.log, /var/log/vmkernel.log

C. NSX Logs

Collected via:

./sos --nsx-logs

NSX Manager location (typical):

/var/log/syslog.log
/var/log/vmware/nsx-manager/

D. Cloud Builder Appliance Logs

SoS collects:

  • Cloud Builder logs
  • API logs
  • Validation logs

Critical path:

/var/log/vmware/vcf/bringup/

E. Aria Lifecycle / VCF Automation Logs

/var/log/vrlcm/vmware_vrlcm.log


🛠 4. Quick Commands for VCF Troubleshooting

Run SoS with all logs

./sos --collect-all-logs

SoS history

./sos --history

SoS debug mode




🎯 5. Summary of Key Issues in VCF 5.x

AreaSymptomsCause
Bring‑upValidation failuresDNS/NTP, wrong hostnames, corrupt ISO
LCMCrashes / stuck tasksBuild mismatch, NSX incompatibility
NSXEdges fail, routing breaksFederation issues, certificates
SDDC ManagerLogin failsLocked account / commonsvcs failure
Workload DomainsDeployment failsvCenter/NSX config mismatch

Comments

Popular posts from this blog

Quick Guide to VCF Automation for VCD Administrators

  Quick Guide to VCF Automation for VCD Administrators VMware Cloud Foundation 9 (VCF 9) has been  released  and with it comes brand new Cloud Management Platform –  VCF Automation (VCFA)  which supercedes both Aria Automation and VMware Cloud Director (VCD). This blog post is intended for those people that know VCD quite well and want to understand how is VCFA similar or different to help them quickly orient in the new direction. It should be emphasized that VCFA is a new solution and not just rebranding of an old one. However it reuses a lot of components from its predecessors. The provider part of VCFA called Tenenat Manager is based on VCD code and the UI and APIs will be familiar to VCD admins, while the tenant part inherist a lot from Aria Automation and especially for VCD end-users will look brand new. Deployment and Architecture VCFA is generaly deployed from VCF Operations Fleet Management (former Aria Suite LCM embeded in VCF Ops. Fleet Management...
  Issue with Aria Automation Custom form Multi Value Picker and Data Grid https://knowledge.broadcom.com/external/article?articleNumber=345960 Products VMware Aria Suite Issue/Introduction Symptoms: Getting  error " Expected Type String but was Object ", w hen trying to use Complex Types in MultiValue Picker on the Aria for Automation Custom Form. Environment VMware vRealize Automation 8.x Cause This issue has been identified where the problem appears when a single column Multi Value Picker or Data Grid is used. Resolution This is a known issue. There is a workaround.  Workaround: As a workaround, try adding one empty column in the Multivalue picker without filling the options. So we can add one more column without filling the value which will be hidden(there is a button in the designer page that will hide the column). This way the end user will receive the same view.  

Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware

 🔹 Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware ⸻ 1️⃣ Memory Ballooning (vmmemctl) Ballooning is the first memory reclamation technique used when ESXi detects memory pressure. ➤ Step-by-Step: How Ballooning Works  1. VMware Tools installs the balloon driver (vmmemctl) inside the guest OS.  2. ESXi detects low free memory on the host.  3. ESXi inflates the balloon in selected VMs.  4. Balloon driver occupies guest memory, making the OS think RAM is full.  5. Guest OS frees idle / unused pages (because it believes memory is needed).  6. ESXi reclaims those freed pages and makes them available to other VMs. Why Ballooning Happens?  • Host free memory is very low.  • ESXi wants the VM to release unused pages before resorting to swapping. Example  • Host memory: 64 GB  • VMs used: 62 GB  • Free: 2 GB → ESXi triggers ballooning  • VM1 (8 GB RAM): Balloon inflates to 2 GB → OS frees 2 GB → ESXi re...