CLI madness part I – performance health check for ESX in 5 minutes

Quite a few customers and partners have asked about quick ways to do live performance health check for VMware ESX environments. The short answer is to evaluate vCenter Operations Suite! It is a very intuitive tool that enables VI admins to pinpoint craps that are causing issues in a given infrastructure. For customers that are more CLI driven and do not yet have the luxury to purchase vCenter Operations (I am always a big fan of command line, as it is geeky, fast, and fun!), there is always esxtop to do some quick spot checks. Below is a quickie on how to use esxtop to do a quick 5 minute health check for your ESX environment end-to-end (compute, network & storage). Part II of this blog will add more color to storage side data analysis for the Nimble array.
To get started, establish an ssh session to the ESX host of interest, and then issue ‘esxtop’ command.
NOTE: It is always a good practice to change the default display interval to “2” seconds for more frequent data refresh (default value is 5). Change this by typing “s”, there will be a prompt to enter a value in seconds, type in “2” and press ENTER
Initial screen would show ‘CPU usage across the board. Type the following key to switch between CPU, memory, HBA adapter, network, LUN, VM virtual disk performance stats:

c – CPU
m – memory
n – network
d – disk adapter (in our example screenshots below, it’ll be vmhba33/37 depending on which one represents the sw/iscsi adapter)
u – all LUNs
v – VMs virtual machine disks (VMDKs)

Quick spot check for CPU usage:
Type ‘c’ to switch to CPU view:

Pay attention to the “% RDY” time for all the VMs, if the value is greater than 10%, the ESX server is CPU bound, meaning the virtual machine vCPU is ready to execute threads, however, VMkernel was not able to schedule it on a PCPU. For example, if VM A has %RDY time of 10, then 1 out of 10 times, it has to wait for VMkernel to schedule a physical CPU for it to execute instructions.

Quick spot check for memory usage:
Type ‘m’ to switch to memory view

Always check to see how much free memory is available. More importantly, see if VMkernel is swapping . Typically, if VMkernel is swapping to disk, the ESX server is memory bound. Potentially due to too much memory overcommitment.

Quick spot check for Network usage:
Type ‘n’ to switch to network view

For Nimble Storage, it is especially important to identify the vmkernel ports here. They are typically ‘vmk1’ and ‘vmk2’, as ‘vmk0’ is used by ESX’s mangement port. For both vmk1 and vmk2, pay attention to “%DRPTX” and “%DRRX” to see how many % of packets are dropped during transmit or receive. If ESX host is pegged with high CPU utilization, there could be good chance of high % packet drops. Throughtput numbers are also useful to see how much traffic is being transmitted in # of packets per second as well as Mbs/sec.

Quick spot check for Storage usage:
Keys to remember:
d – vmhba stats view

v – all virtual machine view

If a given VM has multiple VMDKs, then type “e” to expand the view to include all VMDKs; after typing ‘e’, esxtop will prompt you to enter the GID of the VM of interest, simply type in the correspoinding GID of the VM to see stats for all of its VMDKs:

u – all datastores view

For ‘d’ (vmhba stats), be sure to identify the vmhba # for sw/iscsi adapter (can either find this in vCenter server or CLI by typing ‘#esxcli iscsi adapter list’. While in this view, type ‘f’ to add additiona fields to display, then type ‘d, h, i’ where ‘d’ would add ‘queue stats’, ‘h’ adds ‘read latency stats’ and ‘i’ adds ‘write latency stats’.

In this view, pay close attention to the following fields:
MBREAD/s (read throughput)
MBWRITE/s (write throughput)
DAVG/cmd (storage device latency per scsi command, in ms)
KAVG/cmd (kernel latency per scsi command, in ms)
DAVG/rd (storage device latency for read io, in ms)
DAVG/wr (storage device latency for write io, in ms)
KAVG/rd (kernel latency for read io, in ms)
KAVG/wr (kernel latency for read io, in ms)
The throughtput and “DAVG/cmd” values should match with what you see in the Nimble performance monitoring UI (or stats output)

esxtop in batch mode
If you need to drill down on individual volume/datastore, switch to the datastore view by pressing ‘u’. The tricky part here is the device name is displayed using the NAA ID of the volume.
All of the above are useful for live health check. If you want to collect the stats in batch mode, from each ESX host, for deeper analysis. Below are the steps:
First, run esxtop in batch mode
#cd /tmp
# esxtop -b -d 2 > esxtopresult.csv
The above command will run esxtop in batch mode with an unspecified amount of time. It is useful for cases for customer is not clear how long a given performance test takes. If the customer knows ahead of time how much time their repro/performance test runs, then use the –n to specify the number of iterations esxtop should run. For example, if customer wants to run a 10 minute test, the command to specify is the following (600 seconds in 10 minutes, the command collects stats every 2 seconds, so 600/2 yields 300):
# esxtop -b -d 2 –n 300 > esxtopresult.csv

Now that the esxtopresult.csv file has been generated, find a way for the customer to get the file to you. Once the file has arrived, the following tools are best in analyzing the results:

Perfmon in Microsoft
MS Excel (as the results are in .CSV format)

Perfmon method
Launch performance monitor from Windows machine, hit “CTRL-L”
Specify the source data file (in this case, the .CSV file that you collected)

If the stats were collected over a long period of time, you could click on “Time Range” to narrow down the time range.
Next, click on “Data” tab to add in counters of interest – there are lots of counters available. For us Nimble guys, the storage stats are typically ones of interest:

Select “Physical Disk” as the counter of interest, and expand the available sub-counters. Out of all the ones listed, the common ones to use are:
Latency:
Average Guest MilliSec/Command (this is the latency as observed by the virtual machine guest operating system, it is the avg kernel latency + driver latency + storage array latency)
Average Kernel MilliSec/Command
Average Driver MilliSec/Command
Throughput:
Mbytes Write/sec
Mbytes Read/sec
Once the counters have been selected, then choose the instances to display the counters for, in our example here, that’d be the all the datastores on the ESX host:

Since most Nimble deployments use iSCSI storage, it’d be a good idea to select all the ones that belong to the sw/iscsi vmhba. In my example here, vmhba37 is the sw/iscsi adapter, so all the datastores belonging to that are selected:

By default, perfmon will display all the stats in a color coded graph, each representing the instance object (in this case, the datastore), and all the counters associated with the datastore (in this case, the average latency, throughput numbers). If the graph starts to get busy, uncheck datastores from the list, or limit the counters to just latency, or just throughput.

If you have a thing for MS Excel, check out this blog post on how to create fancy charts from Excel with the esxtop CSV file.
References (these are great references if you really want to dive deeper):
http://communities.vmware.com/docs/DOC-9279
http://www.yellow-bricks.com/esxtop/
http://kb.vmware.com/kb/1008205

Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware

🔹 Step-by-Step Explanation of Ballooning, Compression & Swapping in VMware ⸻ 1️⃣ Memory Ballooning (vmmemctl) Ballooning is the first memory reclamation technique used when ESXi detects memory pressure. ➤ Step-by-Step: How Ballooning Works 1. VMware Tools installs the balloon driver (vmmemctl) inside the guest OS. 2. ESXi detects low free memory on the host. 3. ESXi inflates the balloon in selected VMs. 4. Balloon driver occupies guest memory, making the OS think RAM is full. 5. Guest OS frees idle / unused pages (because it believes memory is needed). 6. ESXi reclaims those freed pages and makes them available to other VMs. Why Ballooning Happens? • Host free memory is very low. • ESXi wants the VM to release unused pages before resorting to swapping. Example • Host memory: 64 GB • VMs used: 62 GB • Free: 2 GB → ESXi triggers ballooning • VM1 (8 GB RAM): Balloon inflates to 2 GB → OS frees 2 GB → ESXi re...

Tech-Gen

Search This Blog