After installing a pair of Virtualized Network Expansion Modules (X4238) in our Blade 6000 chassis, my ESX 3.5 Update 5 hosts will randomly fail with the error "scsi: device might be offline - command error recovery failed: host 5 channel 0 id 0 lun 0." This does not represent LUNS on our Sun/Qlogic FC hbas (PCIe dual 4Gig). The old Dell blades sharing the FC LUNs are not experiencing any SCSI errors whatsoever. It appears to me this is referring to the VNEM's SAS fabric, but we have no SAS drive modules in the chassis. Is the fact that it sees other hosts on this fabric causing issues? I'd like to disable the SAS fabric, but don't see a way to do this via the BIOS or VNEM ILOM. Do you think I'm barking up the wrong tree?
There is some confusing statements in your description.
If you are not using the SAS portion of the X4238 SB 6000 Virtualized 10GB Multi-Fabric Newtork Express Module then I can assume you have removed the REM from the x6250 blade?
There is no direct way to disable the SAS Expander Fabric on the NEM.
Sorry, no. We are not using the SAS features at all. That's what I wanted to test - if I could somehow "hide" this interface from the blades, that would be one fewer potential conflict. But it looks as if that's not a possibility.
We are using the REM for the host OS (VMware ESX 3.5 Update 5). I suppose we could do this without RAID, but I was hoping not to. Besides, the SCSI errors are not related to that HBA, near as I can tell.
The issue with removing the REM from the blades is you will have no access to internal blade SAS drives so the blades won't boot unless booting from external source or using Compact Flash instead. Internal bridge chip only supports connecting to SATA drives without the REM installed I'm afraid.
Some further tips here with the virtualized NEM is to ensure that all blades within the chassis have the latest firmware which includes SP/BIOS/REM adaptec/ NEM firmware. Because this is a shared fabric one down revved blade can affect the rest so ensure everything is the latest. This is documented in the NEM guides.
Yes, I had updated the blade firmware to the current provided (though I see reference to a newer version in the docs, but not at the download site). However, it seems related to interrupt sharing with the NXGE card/driver under ESX 3.5. I have an open case with Sun, and while they suspected a known issue with the aacraid driver (ESX350-201003402-BG), that had already been installed. They're now asking to create a case with VMware. In the meantime, I've tested a pair of our X6250s without the NXGE, installing the original quad port GigE NIC (Intel). Those hosts are very stable. It may not be the VNEM at all, but I'll keep you posted.
I forgot to update this thread. It is, in fact, a bug with the NXGE driver (version 1.3.5 from Jan 2010). They are working on it. I had to remove this card and go back to a 1Gigx4 Intel card. The SB6000 chassis has two 10GigE Virtualized NEMs, so we're using those for the vmkernel, so at least we get 10GigE on the storage (iSCSI to a Storage 7410 cluster). Not as bad as it could be.
Incidentally, I noticed a similar lockup on an identical X6250 with the 10GigE NXGE card with heavy traffic under Red Hat Enterprise Linux 5.5. Given the similarities between RHEL and the ESX 3.5 system console, it's likely the same problem.