This content has been marked as final. Show 11 replies
I'm experiencing a very similar issue but across multiple blade types, and running Red Hat Enterprise instead of ESX:
Blade Types: X6450 (mostly), X6250, X6270
Operating systems: RHEL 4.7 (x86_64), RHEL 4.8 (x86_64), RHEL 5.3 (i686), RHEL 5.4 (x86_64)
These are in multiple Blade6000 chassis in different data centers. Not all of our blades have experienced the problem, but it seems completely random across them. When the blades hang the console stops responding (black screen) and the only recourse is to reboot the system via the ILOM. Nothing special seems to be happening on the blades when they hang (no errors in /var/log/messages, etc) and it happens when they are idle or under load.
I have tickets open with Sun and Red Hat, but lacking any useful information in the logs they have not been able to provide much assistance. We have configured netdump on all these servers and are hoping to get something useful the next time one hangs.
Has this issue been resolved?
I'm running a few X6270 sun blades with Fedora 12 libvirtd/kvm and they have been locking up intermittently for the last several
There's a topic on this very issue in the VMware forums: [http://communities.vmware.com/message/1496206|http://communities.vmware.com/message/1496206]
Unfortunately no one has found a resolution.
Hi, Just read your posts regarding the issue, We to are having what we see as an identical issue, We run a Sun chassis 600 with x6250 blades, The symptoms are identical to the letter.
I have logged calls with Sun and VMware to see if we can progress this issue, Ill keep you posted.
I have visibility of the team actively working this issue within Oracle (Sun).
I am seeing the problem and how it is handled in Linux and the Linux portion of VMWare and how that differs from Solaris.
We have observed the problem and see that both Linux and VMWare pass most error handling to BIOS. BIOS then enters SMI mode to try to handle the communicated error and never exists SMI mode. This is why nothing is logged.
So in effect, the platform is not hanging but just not servicing any new operating system requests which would be observed as a lock up or hang.
Comparing this to Solaris we see the Solaris kernel handle all errors and dismiss what is irrelevant like PCIe training issues or log what's relevant like CE errors from memory DIMMs.
As Linux and VMWare are not doing this, they do not report any log events.
So, the very first action to perform is update the BIOS with the latest BIOS26 from the BIOS and ILOM package archive.
BIOS26 handles both Intel M20 and Intel M3 errors which are CE patrol scrubber and DIMM thermal trip errors.
After installing BIOS26, if you still see a hang, you will hopefully see a log entry in the service processors event log.
If at any time you get an operating system error reported on screen, in a log or via an SNMP mib, you are not experiencing the above issue as your platform is still servicing operating system jobs in the fact that it reports the error.
If your platform is totally unresponsive, you have the error handling issue caused by the operating system passing it's errors to BIOS and BIOS trying to handle them itself.
We are developing a BIOS to diagnose the communicated errors from the operating system and handle or dismiss what is reported but at this time, that BIOS is unavailable.
For those not suffering a complete hangup as described above, check your service processors event log and the service processor fault management daemon for a cause as you may have a simple hardware defect like a memory error.
The fma environment and command line interface on the service processor will show all reported hardware errors and also defect or mapped out parts.
Just to be clear for the listeners the latest firmware for the X6450 is SW3.2 BIOS27/ILOM126.96.36.199.
Sorry, Thanks for the update, After checking for X6250 bios level we are at 3B23 the latest version is 3B24 I cannot see the 3b26 you are referring to?
I have no logs apart from post the forceful reboot in the ilom to suggest any errors, the latest log before the reobot is 4 days earlier
Here is the location of the latest firmware:
The link I just posted is for X6450 SW3.2 BIOS 3B27. The latest for the X6250 is 3B24.