Pages

Wednesday, September 19, 2012

Vmware ESXi 5.0 update 1 Sending Traffic out unused VMNIC because of Failback

We've had some troubled with our Equallogic SAN Performance which lead to us really looking at the iSCSI/Equallogic best practices. This post with a few others will address what we've learned and I wish I could say the Support call to Dell Equallogic Team has been really useful but expect for the exception of one tech named "Chris" that really shared some useful information on the Equallogic side of things we've been on our own for the last 3 weeks.

One of those problems I noticed and shared with the Equallogic team was that following Equallogic's Configuring and Installing the EqualLogic Multipathing Extension Module for VMware vSphere and PS Series SANs  guide to MPIO for iSCSI targets Traffic on the ESXi Host was coming out of the incorrect  vmnic.

The Environment:

  • ESXi Host 5.0.0 build 768111 fully patched as of 9-16-2012
  • Dell Equallogic PS6100XS with firmware 5.2.5
  • Using EqualLogic-ESX-Multipathing-Module v1.1.1

The Problem

If you follow the install guide for the EqualLogic-ESX-Multipathing-Module and use the setup.pl script for the configuring of a vSwitch or Distributed Switch  (vDS) you'll create two ISCSI vmk's and a storage heartbeat vmk.
vSwitch Setup for just iSCSI
On both of those iSCSI networks you configure Nic Teaming and Failover to only use one of Physical Adapters and mark the other unsed and of course alterternate which one is marked unused for the other ISCSI Network.


The Dell setup.pl script marks the failback to No on each iSCSI Network. So  following these best practices in my setup I would expect that vmk2 traffic could only come out vmnic2 as vmnic3 is set to be Unused. And vise versa vmk3 traffic would only come out vmnic3 as vmnic2 is marked as Unused for it.


However when I SSH to the ESXi host and run "esxtop" and hit "n" for network I see the following showing that vmk2 is infact using that what is suppose to be an "unused" vmnic3.  I pointed this out to Dell and they sais it was odd didn't have any answers yet they ever followed up with me on it. I got asked for vmware supports but not even asked to recreate it.



The Fix

After setting this up every way I could think of; trying it using both vSwitches and Distributed Switches, rebuilding my hosts from CD, trying different ESX hosts with different hardware, tried hosts on different network infrastructure. All with the same issue of it using the incorrect nic.

After all I used the script to create the setup to ensure it was correct and best practice. I create it by hand instead of the script and double checked every setting to ensure everything matched. Still no luck. So of course after it was already time to go home I checked ESXTOP one last time on a ESX host I wasn't finished configuring  and behold it was working correctly. Each vmk was bound to its correct vmnic.

vmk using the correct vmnic
What I hadn't done was set the failback to no yet. Everything else was done.


The Reason this Happened
After doing some reading in this epic post by Joshua Townsend that laid out the resent changes round VMware iSCSI Networking. In fact his quick fix at the time was to in fact turn on failback to no. Looking at the EqualLogic-ESX-Multipathing-Module v1.1.1 setup.pl you can see it following Joshua's quick fix and setting the failback option. I also found the same thing in version 1.1.0. The scripts comments even say that its doing so because of a Vmware bug. However VMware says that bug is now fixed (VMware KB 2008144) and instead looks like setting this option introduces a bug instead.


So if you used the Dell EqualLogic-ESX-Multipathing-Module setup script or followed the install guide you may want to check if you do in fact have this problem because Network throughput, Multipathing and Network Redundancy may not work as you expect.