A look at the ESX I/O stack
Given the proliferation of ESX server, I think it's worth looking into the ESX I/O stack, how I/Os get queued and processed until they reach the HBA driver queue. Then, we'll look at example of excessive I/O queuing on a LUN and the effect of increasing the LUN queue depth.
- When an application residing inside a Guest OS issues an I/O, that I/O gets queued to the Guest OS's Virtual Adapter driver.
- The Virtual Adapter driver then passes the I/O to the LSILogic/BusLogic emulator
- The LSI/Buslogic emulator queues I/O to the VMkernel's Virtual SCSI layer. Depending on the configuration, I/Os passes directly to the SCSI layer or it passes thru the VMFS filesystem before it gets to the SCSI layer
- Regardless of the path followed in #3, ultimately all I/Os will end up at the SCSI layer.
- I/Os are then sent to the Host Bus Adapter driver queue. From then on, I/Os hit the Disk array's write cache and finally the back-end disk.
So looking at the above journey, we can quickly observe that:
- Performance can be affected at various queuing stages. For example, performance can be affected by queuing at the VM level. It can affected by queuing at the VMkernel level and can also be affected at the HBA driver level.
- While a VMFS Datastore can have many queues because many VMs can use the same VMFS, a LUN only has as many queues as the per LUN queue depth setting.
So lets see how we can identify potential queuing issues using esxtop.
Figure 1.
Press "d" for Disk Statistics:
Figure 2.
Look at the "LOAD" entry above. This entry specifies the ratio of VMkernel active commands, and VMkernel queued commands in relation to the Adapter Queue depth. In the above example almost everything appeas normal except the QUED column. Too many queued commands in the VMkernel. So lets dig a little deeper and take a look at a specific VM and the LUN it resides in because something doesn't look right.
We'll run esxtop and select the GID for the VM Win2k3_B
Figure 3.
Press "e" to expand and enter the GID.
Figure 4.
Press "d" to display disk statistics and expand on "e,a,t, l" in this order
Figure 5.
From the above it looks like the "LOAD" has hit the roof and the %USD is excessive. %USD signifies the % of the LUN queue depth used by Active commands. LOAD, in this case represents the ratio of Active and Queued commands in the VMkernel in relation to the LUN queue depth. The current value in the above example is 1.97. This value should less than 1. Also keep an eye on READS/s and QUED values. Almost always, a non-zero QUED value indicates a storage bottleneck.
So it looks like the LUN queue depth needs to be increased from the default 32 outstanding I/Os per LUN to 64. After rebooting the host I can verify the new Queue depth by looking at the output of "cat /proc/scsi/qla2300/0". For a single ported HBA the port is 0. For a dual ported HBA, look at 0 and 1.
Figure 6.
After verifying the new LUN queue depth I'm now ready to re-run my previous test.
Figure 7.
Using esxtop and following the same steps as in Figures 4 & 5, I now see that the LOAD has decreased by 50% and is now below 1 as compared to the results in Figure 5. You will also notice, that the READS/s have increased substantially as well by, roughly 37%.
Additionally, by increasing the LUN queue depth, we were able to:
- Increase the number of Active commands in the VMkernel for the LUN (63 from 32)
- Decrease the number of Queued VMkernel commands (0 from 31) for the LUN. In fact, a non-zero value almost always contributes to excessive latency and poor performance. The idea is to be able to process as many commands as possible, without having them in the queue waiting to be serviced.

