This content has been marked as final. Show 11 replies
I'm experiencing a very similar issue but across multiple blade types, and running Red Hat Enterprise instead of ESX:
Blade Types: X6450 (mostly), X6250, X6270
Operating systems: RHEL 4.7 (x86_64), RHEL 4.8 (x86_64), RHEL 5.3 (i686), RHEL 5.4 (x86_64)
These are in multiple Blade6000 chassis in different data centers. Not all of our blades have experienced the problem, but it seems completely random across them. When the blades hang the console stops responding (black screen) and the only recourse is to reboot the system via the ILOM. Nothing special seems to be happening on the blades when they hang (no errors in /var/log/messages, etc) and it happens when they are idle or under load.
I have tickets open with Sun and Red Hat, but lacking any useful information in the logs they have not been able to provide much assistance. We have configured netdump on all these servers and are hoping to get something useful the next time one hangs.
I have visibility of the team actively working this issue within Oracle (Sun).
I am seeing the problem and how it is handled in Linux and the Linux portion of VMWare and how that differs from Solaris.
We have observed the problem and see that both Linux and VMWare pass most error handling to BIOS. BIOS then enters SMI mode to try to handle the communicated error and never exists SMI mode. This is why nothing is logged.
So in effect, the platform is not hanging but just not servicing any new operating system requests which would be observed as a lock up or hang.
Comparing this to Solaris we see the Solaris kernel handle all errors and dismiss what is irrelevant like PCIe training issues or log what's relevant like CE errors from memory DIMMs.
As Linux and VMWare are not doing this, they do not report any log events.
So, the very first action to perform is update the BIOS with the latest BIOS26 from the BIOS and ILOM package archive.
BIOS26 handles both Intel M20 and Intel M3 errors which are CE patrol scrubber and DIMM thermal trip errors.
After installing BIOS26, if you still see a hang, you will hopefully see a log entry in the service processors event log.
If at any time you get an operating system error reported on screen, in a log or via an SNMP mib, you are not experiencing the above issue as your platform is still servicing operating system jobs in the fact that it reports the error.
If your platform is totally unresponsive, you have the error handling issue caused by the operating system passing it's errors to BIOS and BIOS trying to handle them itself.
We are developing a BIOS to diagnose the communicated errors from the operating system and handle or dismiss what is reported but at this time, that BIOS is unavailable.
For those not suffering a complete hangup as described above, check your service processors event log and the service processor fault management daemon for a cause as you may have a simple hardware defect like a memory error.
The fma environment and command line interface on the service processor will show all reported hardware errors and also defect or mapped out parts.