We have 4 Blade6000 with 6 to 7 X6450 each, 2 ST6140 with 2 to 3 jbods each, X6450 blades with the Intel six core dunnington processors.
We installed ESX 3.5U4 on each X6450.
The blades run fine for 10 to 30 days before completely locking up these hosts randomly became unresponsive, the only solution that we have is to power-off power-on the X6450.There is no PSOD. No indication of any log that something has gone wrong. The machine simply stops responding. The issue is intermittant and random.
Actions done with no results:
- Upgrade all SB6000 and X6450 firmware and Bios
- Migrate from ESX 3.5U4 to vSphere U1.
- Increase the swap and the memory of the Service Console.
After a hang of a host which have ESX in maintenance mode, we can definititly rule out the theory of performance.
After the upgrade to vSphere U1 it cant be some thing linked to an ESX bug (we tried 3.5U4 3.5U5 and now vSphere U1)
Now we are focusing on the hardware, specially the X6450 module, i have found on this topic "VMware vSphere ESXi Lockup" exactly the same symtomps that we have , the only thing in common with his configuration is the X6450 six-core.
These are in multiple Blade6000 chassis in different data centers. Not all of our blades have experienced the problem, but it seems completely random across them. When the blades hang the console stops responding (black screen) and the only recourse is to reboot the system via the ILOM. Nothing special seems to be happening on the blades when they hang (no errors in /var/log/messages, etc) and it happens when they are idle or under load.
I have tickets open with Sun and Red Hat, but lacking any useful information in the logs they have not been able to provide much assistance. We have configured netdump on all these servers and are hoping to get something useful the next time one hangs.
I have visibility of the team actively working this issue within Oracle (Sun).
I am seeing the problem and how it is handled in Linux and the Linux portion of VMWare and how that differs from Solaris.
We have observed the problem and see that both Linux and VMWare pass most error handling to BIOS. BIOS then enters SMI mode to try to handle the communicated error and never exists SMI mode. This is why nothing is logged.
So in effect, the platform is not hanging but just not servicing any new operating system requests which would be observed as a lock up or hang.
Comparing this to Solaris we see the Solaris kernel handle all errors and dismiss what is irrelevant like PCIe training issues or log what's relevant like CE errors from memory DIMMs.
As Linux and VMWare are not doing this, they do not report any log events.
So, the very first action to perform is update the BIOS with the latest BIOS26 from the BIOS and ILOM package archive.
BIOS26 handles both Intel M20 and Intel M3 errors which are CE patrol scrubber and DIMM thermal trip errors.
After installing BIOS26, if you still see a hang, you will hopefully see a log entry in the service processors event log.
If at any time you get an operating system error reported on screen, in a log or via an SNMP mib, you are not experiencing the above issue as your platform is still servicing operating system jobs in the fact that it reports the error.
If your platform is totally unresponsive, you have the error handling issue caused by the operating system passing it's errors to BIOS and BIOS trying to handle them itself.
We are developing a BIOS to diagnose the communicated errors from the operating system and handle or dismiss what is reported but at this time, that BIOS is unavailable.
For those not suffering a complete hangup as described above, check your service processors event log and the service processor fault management daemon for a cause as you may have a simple hardware defect like a memory error.
The fma environment and command line interface on the service processor will show all reported hardware errors and also defect or mapped out parts.