11 Replies Latest reply on May 29, 2010 4:52 AM by 807557

    X6450 ESX intermittant lockup

      We have 4 Blade6000 with 6 to 7 X6450 each, 2 ST6140 with 2 to 3 jbods each, X6450 blades with the Intel six core dunnington processors.
      We installed ESX 3.5U4 on each X6450.
      The blades run fine for 10 to 30 days before completely locking up these hosts randomly became unresponsive, the only solution that we have is to power-off power-on the X6450.There is no PSOD. No indication of any log that something has gone wrong. The machine simply stops responding. The issue is intermittant and random.
      Actions done with no results:
      - Upgrade all SB6000 and X6450 firmware and Bios
      - Migrate from ESX 3.5U4 to vSphere U1.
      - Increase the swap and the memory of the Service Console.
      After a hang of a host which have ESX in maintenance mode, we can definititly rule out the theory of performance.
      After the upgrade to vSphere U1 it cant be some thing linked to an ESX bug (we tried 3.5U4 3.5U5 and now vSphere U1)
      Now we are focusing on the hardware, specially the X6450 module, i have found on this topic "VMware vSphere ESXi Lockup" exactly the same symtomps that we have , the only thing in common with his configuration is the X6450 six-core.

      Any help is appreciated
        • 1. Re: X6450 ESX intermittant lockup
          I'm experiencing a very similar issue but across multiple blade types, and running Red Hat Enterprise instead of ESX:

          Blade Types: X6450 (mostly), X6250, X6270
          Operating systems: RHEL 4.7 (x86_64), RHEL 4.8 (x86_64), RHEL 5.3 (i686), RHEL 5.4 (x86_64)

          These are in multiple Blade6000 chassis in different data centers. Not all of our blades have experienced the problem, but it seems completely random across them. When the blades hang the console stops responding (black screen) and the only recourse is to reboot the system via the ILOM. Nothing special seems to be happening on the blades when they hang (no errors in /var/log/messages, etc) and it happens when they are idle or under load.

          I have tickets open with Sun and Red Hat, but lacking any useful information in the logs they have not been able to provide much assistance. We have configured netdump on all these servers and are hoping to get something useful the next time one hangs.
          • 2. X6450 ESX intermittant lockup
            Has this issue been resolved?

            I'm running a few X6270 sun blades with Fedora 12 libvirtd/kvm and they have been locking up intermittently for the last several
            • 3. Re: X6450 ESX intermittant lockup
              There's a topic on this very issue in the VMware forums: [http://communities.vmware.com/message/1496206|http://communities.vmware.com/message/1496206]

              Unfortunately no one has found a resolution.
              • 4. Re: X6450 ESX intermittant lockup
                Hi, Just read your posts regarding the issue, We to are having what we see as an identical issue, We run a Sun chassis 600 with x6250 blades, The symptoms are identical to the letter.

                I have logged calls with Sun and VMware to see if we can progress this issue, Ill keep you posted.
                • 5. Re: X6450 ESX intermittant lockup
                  Hi Guys,

                  I have visibility of the team actively working this issue within Oracle (Sun).
                  I am seeing the problem and how it is handled in Linux and the Linux portion of VMWare and how that differs from Solaris.
                  We have observed the problem and see that both Linux and VMWare pass most error handling to BIOS. BIOS then enters SMI mode to try to handle the communicated error and never exists SMI mode. This is why nothing is logged.
                  So in effect, the platform is not hanging but just not servicing any new operating system requests which would be observed as a lock up or hang.
                  Comparing this to Solaris we see the Solaris kernel handle all errors and dismiss what is irrelevant like PCIe training issues or log what's relevant like CE errors from memory DIMMs.
                  As Linux and VMWare are not doing this, they do not report any log events.
                  So, the very first action to perform is update the BIOS with the latest BIOS26 from the BIOS and ILOM package archive.
                  BIOS26 handles both Intel M20 and Intel M3 errors which are CE patrol scrubber and DIMM thermal trip errors.
                  After installing BIOS26, if you still see a hang, you will hopefully see a log entry in the service processors event log.

                  If at any time you get an operating system error reported on screen, in a log or via an SNMP mib, you are not experiencing the above issue as your platform is still servicing operating system jobs in the fact that it reports the error.
                  If your platform is totally unresponsive, you have the error handling issue caused by the operating system passing it's errors to BIOS and BIOS trying to handle them itself.

                  We are developing a BIOS to diagnose the communicated errors from the operating system and handle or dismiss what is reported but at this time, that BIOS is unavailable.
                  For those not suffering a complete hangup as described above, check your service processors event log and the service processor fault management daemon for a cause as you may have a simple hardware defect like a memory error.
                  The fma environment and command line interface on the service processor will show all reported hardware errors and also defect or mapped out parts.


                  • 6. Re: X6450 ESX intermittant lockup
                    Just to be clear for the listeners the latest firmware for the X6450 is SW3.2 BIOS27/ILOM3.0.6.13.
                    • 7. Re: X6450 ESX intermittant lockup
                      Hi Anthony,
                      • 8. Re: X6450 ESX intermittant lockup
                        Hi Anthony,
                        • 9. Re: X6450 ESX intermittant lockup
                          Sorry, Thanks for the update, After checking for X6250 bios level we are at 3B23 the latest version is 3B24 I cannot see the 3b26 you are referring to?

                          I have no logs apart from post the forceful reboot in the ilom to suggest any errors, the latest log before the reobot is 4 days earlier
                          • 10. Re: X6450 ESX intermittant lockup
                            Here is the location of the latest firmware:

                            • 11. Re: X6450 ESX intermittant lockup
                              The link I just posted is for X6450 SW3.2 BIOS 3B27. The latest for the X6250 is 3B24.