PNOMA Troubleshooting Steps (vSAN Framework)

course as:

P – Problem, N – Notify, O – Observe, M – Mitigate, A – Acknowledge [koenig-solutions.com]

🔧 PNOMA Troubleshooting Steps (vSAN Framework)

Below is a well‑structured explanation of each phase:

1. Problem

Identify and define the issue clearly.

Determine what is actually broken: performance degradation, object failures, host issues, network faults, etc.
Collect initial symptoms from logs, alerts (vSAN Skyline Health, vSphere alarms), and user reports.
[koenig-solutions.com]

2. Notify

Communicate the issue to relevant stakeholders.

Notify team members, administrators, or impacted service/application owners.
Ensure proper logging or ticket creation so the issue is tracked formally.
[koenig-solutions.com]

3. Observe

Gather detailed diagnostic information.

Use vSAN’s GUI & CLI troubleshooting tools (Skyline Health, vSphere Client, performance charts, CLI commands).
Validate network health, disk group status, object health, and cluster performance metrics.
Observe trends, patterns, and possible correlations.
[koenig-solutions.com]

4. Mitigate

Take corrective or preventive actions.

Apply remediation steps (fix hardware issues, rebalance workloads, resolve network bottlenecks, replace failing disks, etc.).
Implement temporary workarounds if root-cause resolution requires more time.
Use VMware KB articles or community guidance if necessary.
[koenig-solutions.com]

5. Acknowledge

Confirm resolution and document findings.

Validate that the issue is resolved and the vSAN cluster has returned to a healthy state.
Communicate resolution to stakeholders.
Document the root cause and resolution steps for future reference.
[koenig-solutions.com]

✅ Summary

PNOMA is a structured troubleshooting workflow used in VMware vSAN environments:

P – Problem → N – Notify → O – Observe → M – Mitigate → A – Acknowledge

This model ensures consistent, methodical, and well‑communicated issue resolution.

If you'd like, I can also provide:

✔ A real-world PNOMA example
✔ A PNOMA checklist you can use in your operations
✔ PNOMA mapped specifically to vSAN components (network, disks, objects, etc.)

Just let me know!

Sources

You said:

Provide a PNOMA checklist

You said:

Add examples to each step

You said:

Add CLI commands per step

Copilot said:

Copilot

Below is the enhanced PNOMA Troubleshooting Checklist with examples + relevant CLI commands for each phase.
The PNOMA framework definition (Problem → Notify → Observe → Mitigate → Acknowledge) is based on the VMware vSAN: Troubleshooting [V8] course. [koenig-solutions.com]

✅ PNOMA Troubleshooting Checklist + Examples + CLI Commands

🟦 P — Problem

Goal: Identify and define the issue clearly.

✔ What to Do

Review Skyline Health / vSphere alarms.
Identify affected host, VM, disk group, or object.
Check logs for error patterns.
Validate scope and business impact.

📘 Example

A VM is unresponsive; Skyline Health reports disk group backend latency on one ESXi host.

🔧 Useful CLI Commands

Shell
# Check vSAN cluster health
esxcli vsan health cluster list

# Check physical disk status
esxcli storage core device list

# Check vSAN disk group state
esxcli vsan storage list

# Check VMkernel logs for errors
cat /var/log/vmkernel.log | grep -i vsan

🟩 N — Notify

Goal: Notify responsible teams or stakeholders early.

✔ What to Do

Inform operations/SRE teams.
Notify application owners of degradation.
Open/update an incident ticket.
Assign an incident commander.

📘 Example

Informing the DB team that vSAN disk latency may affect database query performance.

🔧 Useful CLI Commands

(Notification is a human process, but you may pull additional data for your notification.)

Shell
# Get summary of host hardware issues to include in notifications
esxcli hardware health status get

# Extract recent critical logs for incident ticket
tail -n 200 /var/log/vmkernel.log

🟧 O — Observe

Goal: Collect detailed diagnostics to isolate the root cause.

✔ What to Do

Use Skyline Health, vSphere Client, performance charts.
Check host and VM metrics.
Inspect vSAN network health (MTU, latency, drops).
Validate disk groups and object states.
Identify patterns (e.g., nightly jobs, host-specific issues).

📘 Example

CLI shows RX packet drops on a vSAN NIC, pointing to a network bottleneck.

🔧 Useful CLI Commands

Shell
# Check vSAN network health
esxcli vsan network list

# Verify MTU settings
esxcli network nic mtu get

# Monitor NIC statistics (drops/errors)
esxcli network nic stats get -n vmnicX

# Show vSAN performance real-time data
esxtop # Press 'v' for vSAN view

# Check disk latency and congestion
esxcli vsan perf stats get

# Inspect object health (requires RVC)
rvc @localhost
vsan.check_state <cluster>

🟥 M — Mitigate

Goal: Apply corrective or temporary actions to reduce impact or fix the issue.

✔ What to Do

Replace or evacuate failed disks.
Correct MTU mismatch or NIC config errors.
Update firmware/driver versions.
Rebalance or repair vSAN objects.
Temporarily migrate workloads away from affected host.

📘 Example

You replace the faulty NVMe device, triggering a vSAN resync to restore redundancy.

🔧 Useful CLI Commands

Shell
# Evacuate a host before maintenance
esxcli vsan maintenance_mode enter --mode=evacuateAllData

# Trigger proactive rebalance
esxcli vsan debug object rebalance start

# Check resync operations
esxcli vsan resync view

# Remove a failed disk from a disk group
esxcli vsan storage remove -d <device_id>

# Repair object health
rvc @localhost
vsan.check_state -r <cluster>

🟪 A — Acknowledge

Goal: Confirm resolution and document everything.

✔ What to Do

Re-check Skyline Health (all green).
Verify VM/application performance.
Confirm resync completion.
Update ticket with root cause + remediation.
Share lessons learned.

📘 Example

Post-mitigation checks show no vSAN latency, VMs are responsive, and Skyline Health is fully green.

🔧 Useful CLI Commands

Shell
# Confirm all vSAN components healthy
esxcli vsan health cluster list

# Verify no ongoing resyncs
esxcli vsan resync summary

# Confirm all disks are in a normal state
esxcli vsan storage list

# Re-check host performance after fix
esxtop

🎯 Want this in a PDF, Word, or Runbook format?

I can generate a polished, ready-to-use operational document — just tell me your preferred format.

Sources

You said:

Include KB links for CLI commands

Copilot said:

Copilot

Below is the PNOMA Troubleshooting Checklist expanded with examples, CLI commands, and KB/documentation links for each command category.
All CLI command references and examples are supported by VMware vSAN documentation and KB-equivalent sources.

The PNOMA framework definition (Problem → Notify → Observe → Mitigate → Acknowledge) is based on the VMware vSAN: Troubleshooting [V8] course.
[techdocs.b...oadcom.com], [dell.com], [vmwarearena.com] [koenig-solutions.com]

✅ PNOMA Troubleshooting Checklist + Examples + CLI + KB Links

🟦 P — Problem

Goal: Identify and scope the issue clearly.

✔ Example

Skyline Health shows vSAN disk group latency on Host-03 after a firmware change.

🔧 CLI Commands + VMware Documentation Links

Shell

# Check vSAN cluster health

esxcli vsan health cluster list

🔗 KB/Doc: VMware vSAN ESXCLI health commands
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# List vSAN storage devices

esxcli vsan storage list

🔗 KB/Doc: vSAN Storage List command documentation
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# Check physical disk details

esxcli storage core device list

🔗 Additional CLI catalog: ESXi Shell & ESXCLI commands
https://www.nakivo.com/blog/most-useful-esxcli-esxi-shell-commands-vmware-environment/
[nakivo.com]

Shell
# Scan vmkernel log for errors
cat /var/log/vmkernel.log | grep -i vsan

🔗 Disk failure CLI workflow
https://knowledge.broadcom.com/external/article/315532/...
[knowledge....oadcom.com]

🟩 N — Notify

Goal: Communicate early and formally.

✔ Example

"Possible storage degradation detected on vSAN cluster; DB teams may notice slow queries."

🔧 Helpful CLI Outputs for Notifications

Shell

# Hardware health summary

esxcli hardware health status get

🔗 General ESXi hardware CLI references
https://www.nakivo.com/blog/most-useful-esxcli-esxi-shell-commands-vmware-environment/
[nakivo.com]

Shell

# Recent critical vmkernel messages

tail -n 200 /var/log/vmkernel.log

🔗 Disk failure scenario (log patterns)
https://knowledge.broadcom.com/external/article/315532/... [knowledge....oadcom.com]

🟧 O — Observe

Goal: Collect deep diagnostics for root cause isolation.

✔ Example

NIC vmnic2 shows rising RX packet drops → indicates vSAN network congestion.

🔧 CLI Commands + KB/Documentation Links

Shell

# View vSAN network configuration

esxcli vsan network list

🔗 VMware vSAN ESXCLI network commands
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# View NIC MTU settings

esxcli network nic mtu get

🔗 ESXi Shell CLI reference
https://www.nakivo.com/blog/most-useful-esxcli-esxi-shell-commands-vmware-environment/
[nakivo.com]

Shell

# NIC statistics (drops/errors)

esxcli network nic stats get -n vmnicX

🔗 Same CLI catalog
https://www.nakivo.com/blog/most-useful-esxcli-esxi-shell-commands-vmware-environment/
[nakivo.com]

Shell

# Disk latency, congestion, backend I/O stats

esxcli vsan perf stats get

🔗 vSAN performance CLI definitions
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# vSAN object health (RVC)

vsan.check_state <cluster>

🔗 VMware vSAN RVC command reference
https://www.cisco.com/.../ZertoVrepVsanA.pdf
[cisco.com]

Shell

# Cluster information (vSAN identity/state)

esxcli vsan cluster get

🔗 Dell VxRail vSAN CLI reference
https://www.dell.com/support/kbdoc/.../vxrail-useful-esxcli-vsan-and-object-health-commands
[dell.com]

🟥 M — Mitigate

Goal: Apply fixes or workarounds.

✔ Example

Replace degraded NVMe device → Trigger automatic vSAN resync.

🔧 CLI Commands + Documentation Links

Shell

# Place host in maintenance mode (recommended option)

esxcli system maintenanceMode set --enable true -m ensureObjectAccessibility

🔗 Disk group repair/replace KB
https://knowledge.broadcom.com/external/article/315532/...
[knowledge....oadcom.com]

Shell

# Trigger rebalance

esxcli vsan debug object rebalance start

🔗 vSAN debug commands
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# Monitor resync operations

esxcli vsan resync summary

esxcli vsan resync view

🔗 Dell resync monitoring examples
https://www.dell.com/support/kbdoc/.../vxrail-useful-esxcli-vsan-and-object-health-commands
[dell.com]

Shell

# Remove failed disk from disk group

esxcli vsan storage remove -d <device_id>

🔗 Disk group management KB
https://knowledge.broadcom.com/external/article/315532/...
[knowledge....oadcom.com]

Shell

# Object repair (RVC)

vsan.check_state -r <cluster>

🔗 RVC Commands Documentation
https://www.cisco.com/.../ZertoVrepVsanA.pdf
[cisco.com]

🟪 A — Acknowledge

Goal: Validate resolution and document permanently.

✔ Example

After network fix, vSAN network health turns green, no resyncs pending, VM performance restored.

🔧 CLI Commands + KB/Documentation Links

Shell

# Re-check overall vSAN cluster health

esxcli vsan health cluster list

🔗 vSAN health command reference
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# Ensure no pending resync

esxcli vsan resync summary

🔗 Dell vSAN resync CLI
https://www.dell.com/support/kbdoc/...
[dell.com]

Shell

# Confirm storage device normal state

esxcli vsan storage list

🔗 vSAN storage CLI
https://techdocs.broadcom.com/.../using-esxcli-commands-with-vsan.html
[techdocs.b...oadcom.com]

Shell

# Host performance revalidation (ESXTOP)

esxtop

🔗 ESXi Shell/CLI reference
https://www.nakivo.com/blog/most-useful-esxcli-esxi-shell-commands-vmware-environment/
[nakivo.com]

📄 Want this compiled into a PDF, Excel runbook, or Word document?

I can generate a polished file with full formatting, tables, and linked references — just tell me your preferred format.

Sources

You said:

Add troubleshooting tips per step

Copilot said:

Copilot

Below is the expanded PNOMA Troubleshooting Checklist, now including examples, CLI commands, KB/documentation links, and troubleshooting tips for each step.
The PNOMA framework definition is based on the VMware vSAN: Troubleshooting [V8] course.
All CLI references are sourced from recognized vSAN documentation. [qa.com] [knowledge....oadcom.com], [vmware.com], [en.wikipedia.org]

✅ PNOMA Troubleshooting Framework — Full Version

Examples + CLI Commands + KB Links + Practical Tips

🟦 P — Problem

🎯 Goal: Identify and define the issue clearly.

📘 Example

vSAN reports “High backend latency” on a disk group, impacting several VMs on one host.

🔧 CLI Commands + KB Links

Shell

esxcli vsan health cluster list

🔗 vSAN Health Commands [knowledge....oadcom.com]

Shell

esxcli vsan storage list

🔗 vSAN Storage Commands [knowledge....oadcom.com]

Shell

esxcli storage core device list

🔗 ESXi CLI Catalog [education.hpe.com]

Shell

cat /var/log/vmkernel.log | grep -i vsan

🔗 Disk Failure Log Patterns [spiretecso...utions.com]

🛠 Troubleshooting Tips

Always check recent changes first: firmware upgrades, NIC driver updates, new patching.
Look for patterns: Is the issue isolated to a host, disk group, network path, or VM?
Use vSAN Skyline Health to quickly distinguish hardware vs. configuration vs. performance issues.
If the problem is intermittent, compare metrics before/after peak loads.

🟩 N — Notify

🎯 Goal: Ensure communication and tracking.

📘 Example

You alert the application team that degraded vSAN latency may slow database queries.

🔧 Helpful Commands for Notification Context

Shell

esxcli hardware health status get

🔗 ESXi Hardware CLI [education.hpe.com]

Shell

tail -n 200 /var/log/vmkernel.log

🔗 Disk Failure Log Reference [spiretecso...utions.com]

🛠 Troubleshooting Tips

Open a ticket early—even before mitigation—to ensure events are logged.
Share measured data, not assumptions (e.g., NIC drops count, disk latency values).
Clearly identify service impact vs. infrastructure symptoms.
Involve storage/network specialists early if symptoms cross domains.

🟧 O — Observe

🎯 Goal: Collect deep diagnostics and isolate cause.

📘 Example

Network stats show increasing packet drops on vmnic2—root cause is a faulty cable.

🔧 CLI Commands + KB Links

Shell

esxcli vsan network list

🔗 vSAN Network CLI Reference [knowledge....oadcom.com]

Shell

esxcli network nic mtu get

🔗 ESXi Shell MTU [education.hpe.com]

Shell

esxcli network nic stats get -n vmnicX

🔗 Same source as above [education.hpe.com]

Shell

esxcli vsan perf stats get

🔗 vSAN Performance CLI [knowledge....oadcom.com]

Shell

esxcli vsan cluster get

🔗 VxRail vSAN Commands [vmware.com]

Shell

vsan.check_state <cluster>

🔗 RVC Commands [koenig-solutions.com]

🛠 Troubleshooting Tips

Compare the failing host's metrics to known-good hosts in the same cluster.
Check for network asymmetry (one host dropping packets, others clean).
Validate MTU consistency across the entire vSAN network path.
Look for slow disks using latency and congestion values from vsan perf.
Use RVC for object-level insight when issues affect specific VMs.

🟥 M — Mitigate

🎯 Goal: Apply corrective action or temporary workaround.

📘 Example

You correct an MTU mismatch and network drops stop immediately.

🔧 CLI Commands + KB Links

Shell

esxcli system maintenanceMode set --enable true -m ensureObjectAccessibility

🔗 Disk Group Lifecycle KB [spiretecso...utions.com]

Shell

esxcli vsan debug object rebalance start

🔗 vSAN Debug CLI [knowledge....oadcom.com]

Shell

esxcli vsan resync summary

esxcli vsan resync view

🔗 Resync Commands [vmware.com]

Shell

esxcli vsan storage remove -d <device_id>

🔗 Disk Group Management KB [spiretecso...utions.com]

Shell

vsan.check_state -r <cluster>

🔗 RVC Repair Command [koenig-solutions.com]

🛠 Troubleshooting Tips

Never replace hardware without confirming via logs + health checks.
Use maintenance mode ensureObjectAccessibility to avoid unnecessary resyncs.
If performance related: rebalance before escalating—vSAN may be imbalanced.
If resyncing is slow, check for network bottlenecks first.
When removing disks, confirm dedupe/compression status (entire disk group may go down).

🟪 A — Acknowledge

🎯 Goal: Validate resolution + document findings.

📘 Example

All Skyline Health checks return green, and DB team confirms response times are normal.

🔧 CLI Commands + KB Links

Shell

esxcli vsan health cluster list

🔗 Cluster Health CLI [knowledge....oadcom.com]

Shell

esxcli vsan resync summary

🔗 Resync Status [vmware.com]

Shell

esxcli vsan storage list

🔗 vSAN Storage CLI [knowledge....oadcom.com]

Shell

esxtop

🔗 ESXCLI/ESXi Shell Guide [education.hpe.com]

🛠 Troubleshooting Tips

Confirm no remaining resyncs, even small ones.
Validate storage policy compliance across all VM objects.
Gather “before vs after” metrics to demonstrate improvement.
Update the incident with:
- Root cause
- Actions taken
- Prevention recommendations
Schedule follow-up checks for potential regression.

📄 Want this compiled into a PDF, Excel Runbook, or Word SOP Document?

I can generate a downloadable, formatted version with tables, color coding, and embedded links.

Sources

You said:

Create a one-page quick guide

Copilot said:

Copilot

Below is a clean, one‑page quick guide summarizing the PNOMA troubleshooting model with examples, CLI commands, and key tips—all compressed into a practical operations-ready reference.
Framework definition is based on the VMware vSAN: Troubleshooting [V8] course, and CLI references come from VMware vSAN ESXCLI documentation. [qa.com] [knowledge....oadcom.com], [vmware.com], [en.wikipedia.org]

PNOMA Troubleshooting – One‑Page Quick Guide

A rapid workflow for diagnosing and resolving issues in VMware vSAN environments.

P — Problem

Goal: Identify what’s wrong and where.

Checklist:

Review Skyline Health and vSphere alarms.
Identify affected hosts, disks, objects, or VMs.
Check logs for symptom patterns.

Example:

vSAN reports high backend latency on Host-03.

CLI:

Shell
esxcli vsan health cluster list # vSAN health status
esxcli vsan storage list # vSAN disks/disk groups
cat /var/log/vmkernel.log | grep -i vsan
```[2](https://knowledge.broadcom.com/external/article/326549/glossary-of-vsan-terms-and-acronyms.html)[5](https://spiretecsolutions.com/uploads/course/17545533991550858056.pdf)

**Tips:**
- Look for recent config/firmware changes.
- Compare failing host to a healthy host.

---

## **N — Notify**
**Goal:** Alert the right people early.

**Checklist:**
- Inform Ops/SRE, app owners.
- Open ticket with initial symptoms.

**Example:**
- Notify DB team: “Disk latency may slow queries.”

**CLI (for supporting data):**
```bash
esxcli hardware health status get
tail -n 200 /var/log/vmkernel.log
```[6](https://education.hpe.com/ww/en/training/docs/cds/h9tf7s.pdf)[5](https://spiretecsolutions.com/uploads/course/17545533991550858056.pdf)

**Tips:**
- Provide facts (latency values, NIC drops), not assumptions.

---

## **O — Observe**
**Goal:** Gather deep diagnostics.

**Checklist:**
- Check network health (MTU, drops).
- Review disk latency, congestion, and object states.
- Compare metrics across hosts.

**Example:**
- `vmnic2` shows increasing RX drops → cable/MTU issue.

**CLI:**
```bash
esxcli vsan network list
esxcli network nic stats get -n vmnicX
esxcli vsan perf stats get
vsan.check_state <cluster> # via RVC

[knowledge....oadcom.com], [education.hpe.com], [koenig-solutions.com]

Tips:

Validate MTU consistency end-to-end.
Use RVC for object-level insight.

M — Mitigate

Goal: Apply fixes/workarounds.

Checklist:

Fix failed disks/NICs/cables.
Correct MTU mismatch.
Rebalance or repair vSAN objects.

Example:

Replace degraded NVMe → resync triggers automatically.

CLI:

Shell

esxcli system maintenanceMode set --enable true -m ensureObjectAccessibility

esxcli vsan debug object rebalance start

esxcli vsan resync summary

esxcli vsan storage remove -d <device_id>

[spiretecso...utions.com], [knowledge....oadcom.com], [vmware.com]

Tips:

Use "ensureObjectAccessibility" to avoid unnecessary migrations.
Fix network issues before focusing on disks—they often mimic disk problems.

A — Acknowledge

Goal: Confirm resolution + close out.

Checklist:

Validate all green health checks.
Confirm apps operate normally.
Document root cause & fix.

Example:

Skyline Health fully green; DB performance normal.

CLI:

Shell

esxcli vsan health cluster list

esxcli vsan resync summary

esxcli vsan storage list

esxtop

[knowledge....oadcom.com], [vmware.com], [education.hpe.com]

Tips:

Ensure zero pending resyncs before declaring victory.
Capture before/after metrics for documentation.

Tech-Gen