1 2 3 4 Previous Next 46 Replies Latest reply on Mar 14, 2011 2:15 PM by rukbat Go to original post
      • 15. Re: Proliant DL380 / Broadcom 5709 net outage
        807559
        Hi All,

        I’ve just spoken with a chap called mui on #opensolaris on irc.freenode.net who reports that this issue relates to “C States”. Disabling “C States” in the BIOS (It’s in “Processor Settings” on Dell boxes) supposedly will work-around the issue. C States support was added in Solaris 10 update 8, so this may be why none of our Solaris 10 update 7 boxes were affected by the issue.

        I'm going to give this a go and see if it fixes it.

        Supposedly Sun have a patch available if you have support contract - but getting Solaris support on your HP/Dell hardware might be somewhat difficult these days :)

        Cheers,

        Alasdair
        • 16. Re: Proliant DL380 / Broadcom 5709 net outage
          807559
          Hi,
          That's great news - i found it hard to find an actual person to contact on the opensolaris side ( i have a watch on bug 6926051 ) - the bug was marked as fixed on the 24th but no other info was provided.

          We have HP Solaris support - which means HP have a crack then proxy the call to Sun on the customers behalf. The call has been with Sun now about 3-4 days.

          Last update yesterday was that it has been escalated. I'll make sure they know about your findings - hopefully a fix is on the way.

          Thanks again...
          • 17. Re: Proliant DL380 / Broadcom 5709 net outage
            807559
            while i await the official paths to come to a conclusion i used the technique noted by alasdair and now have the 6.0.1 driver.

            I will try it out tomorrow on a test box
            • 18. Re: Proliant DL380 / Broadcom 5709 net outage
              807559
              didnt' work unfortunately, obviously some other dependency needed:

              [ID 819705 kern.notice] //kernel/drv/amd64/bnx: undefined symbol
              [ID 826211 kern.notice] 'ddi_quiesce_not_supported'
              • 19. Re: Proliant DL380 / Broadcom 5709 net outage
                807559
                now have driver 5.2.3 running on 2 boxes - running 3x continuous 1gb wget's between a patched box and an unpatched box - let's see which one goes first!

                thanks alasdair
                • 20. Re: Proliant DL380 / Broadcom 5709 net outage
                  807559
                  Hi All --

                  I am having this issue as well.

                  Primarily on HP DL360 G6's 4.6.2 driver and Ver4060004 firmware. Solaris 10 update 7.

                  bnx: [ID 517869 kern.info] NOTICE: Broadcom NetXtreme II Gigabit Ethernet Driver v4.6.2
                  bnx: [ID 995108 kern.info] NOTICE: bnx0: BCM5709 device with F/W Ver4060004 is initialized.

                  I have a medium number [20-50] DL360 G5's and G6's in production and this issue only recently occurred after a recent 10_Recommened patch cluster was applied.


                  So far it has only happened on some of the G6's [at random] but the G5's run the same NIC firmware and driver revsion so we will see. The big difference between the G5s and G6's is the Intel CPU and chipset.


                  No firmware or BIOS or Driver changes were made.

                  TO CONFIRM -- -

                  This issue ONLY happened AFTER a patch cluster. Firmware and Driver was not changed. Patch cluster did not include an updated BNX driver. The BNX driver we use [4.6.2] was obtained via HP.

                  The old kernel in which there was no problems was Generic_139556-08

                  The new kernel that we patched to and started to see issues with is Generic_142901-10

                  I strongly suspect its actually a kernel hook or GLD change that triggers the issue with the old/existing drivers, hence it never been triggered in 6-12 months of usage previous to the patch cluster.

                  Very annoying.

                  Has anyone had any luck with HP or Sun? If so pls PM the ticket ref so I can tack my bug report on.

                  I've been trying to replicate it in production overnight with heavy NFS/SCP/Multicast usage to no avail. Any suggestions on a way to replicate semi-reliably?

                  Thanks very much!

                  Jack.
                  • 21. Re: Proliant DL380 / Broadcom 5709 net outage
                    807559
                    Hi Jack,
                    welcome to the 'fun'

                    I have a HP case open, they are proxying the case to Sun. Last advice received was the ONLY the Sun driver is supported so I backed out the HP one and installed the Sun driver from the OS CD + latest recommended. There seems a lot of confusion as to which driver to use, HP documents say use their one. Other documents say just download directly from Broadcom and Sun say to user their one (which seems to be a patch 4.6.2 driver).

                    This server failed overnight so it's back to the drawing board really - reading between the lines Sun won't even look at the problem if it's not their driver. So now I have a failure with their driver maybe we can get somewhere.

                    As you can see in the thread i have tried across 8 servers:
                    - two different firmwares
                    - sun + patches and hp drivers
                    - with and without ipmp

                    Interesting theory about U7 and the patch cluster breaking things - I and others have only experienced this issue with U8.

                    Cheers
                    • 22. Re: Proliant DL380 / Broadcom 5709 net outage
                      807559
                      Hi jack,
                      not sure how to whim you my case number- it's HP case 4615849501

                      very difficult to reproduce - i did 3 simultaneous wget's grabbing large files for 5 days to trip it up
                      • 23. Re: Proliant DL380 / Broadcom 5709 net outage
                        807559
                        Hi -- Thanks for your info.

                        Have been doing more thinking/testing/reporting about this. Have a case open as well, will let you know any progress.

                        Long story short, we were pretty quick to move from HP DL360 G5's to G6's, and have been running G6's in production with Solaris 10 U7 happily without a single fault since late last year.

                        Big difference between G5 and G6 is the Intel Xeon CPU. Our G5's run E5530's, which do NOT have much advanced power saving functionality. [C0-3 only]

                        The G6's run E5440's which introducted the new "Deep Power Down State" [C6].

                        U7 had no advanced power savings functionality. That was a feature of U8.

                        My current working theory is that when we patched our G6's and moved the kernel from an approx June 09 [U7] release to a May 10 release [Obviously post U8], we got a kernel that did in fact enable this new C6 Deep Power Down State.

                        I strongly suspect [and with the hints from Alasdair and his chat with the opensolaris dev] that the new advanced C-States [c4-6] are in fact an issue and probably break the current BNX driver somehow.

                        This ties in with why we're seeing it cross hardware platform [DL360s, DL380s, Dells[!!]]

                        It also ties in with the difficultly in replicating the issue. I suspect its something to do with cores been taking wholely offline or shutdown when idle.

                        We seem to see it on boxes that are "quiet-ish" but with loads of network traffic. Certainly we would not be using more than 4 cores at the time our failures have occured. [All boxes are 8 core or above]

                        FYI - We have always and still do disable power management in SMA, so its probably not a direct issue with powerd [not running!]

                        ANYWAY --

                        We have disabled C-state power management on all our G6's and will monitor the situation and see what happens.

                        We had 3 failures in the last few days, if we can last the week with no more, I'm going to file this as "solved for now" and go do some real work.

                        Will keep everyone informed.

                        FYI - To do this on HP BIOS, F9 for BIOS, then

                        power mgmt options -> advanced power mgmt options -> minimum process idle power state -> No C-states

                        On boot we then see;

                        [ID 196753 kern.warning] WARNING: cpupm_init: processor 6: unable to initialize P-state support
                        [ID 763091 kern.warning] WARNING: cpupm_init: processor 6: unable to initialize C-state support

                        FYI x 2

                        kstat -m cpu_info

                        Will report your CPU info, current C-state and Supported Max C-States

                        If anyone has a failure, I'd love to see the kstat output of a failed system. [Lol cant get it to happen when you want it to happen!!]

                        Good luck,

                        Cheers,
                        Jack.
                        • 24. Re: Proliant DL380 / Broadcom 5709 net outage
                          807559
                          Hi,
                          i had a failure last night on a box with:
                          - cpupm enabled ( ie Solaris doing the power mgmt )
                          - sun bnx + latest patches + specific patches mentioned by sun

                          Sun previously told me that it had nothing to do with C-States nor the previously mentioned opensolaris bug so i left c-states enabled on this box.

                          i have a full kstat output from when the box was offline if you want it, here is cpu_info for cpu 0:
                          module: cpu_info instance: 0
                          name: cpu_info0 class: misc
                          brand Intel(r) Xeon(r) CPU X5570 @ 2.93GHz
                          chip_id 0
                          clock_MHz 2933
                          clog_id 0
                          core_id 0
                          cpu_type i386
                          crtime 142.573042836
                          current_clock_Hz 1596000000
                          current_cstate 1
                          family 6
                          fpu_type i387 compatible
                          implementation x86 (chipid 0x0 GenuineIntel family 6 model 26 step 5 clock 2933 MHz)
                          model 26
                          ncore_per_chip 4
                          ncpu_per_chip 8
                          pkg_core_id 0
                          snaptime 176362.333493352
                          state on-line
                          state_begin 1278293689
                          stepping 5
                          supported_frequencies_Hz 1596000000:1729000000:1862000000:1995000000:2128000000:2261000000:2394000000:2527000000:2660000000:2793000000:
                          2926000000:2927000000
                          supported_max_cstates 2
                          vendor_id GenuineIntel

                          interestingly, all C-states are 1 (while it's dead) which is unusual.

                          normal operation viewed through powertop has C0 10%, C1 0.6%, C2 0.0% and C4 90%
                          with a little script such as:
                          while true ; do
                          pfexec kstat -m cpu_info|grep current_cstate|sort|uniq
                          done

                          i see a lot of '1', the odd '0' and the odd '3'


                          another odd thing, if you snoop bnx1 (plumbed as an ipmp failover with 0.0.0.0) it receives traffic (ARP and multicast only as no IF address) so it's functioning at some level. Manually plumbing bnx1 doesn't resolve the issue.
                          • 25. Re: Proliant DL380 / Broadcom 5709 net outage
                            807559
                            I think the question is happened on the all Broadcom 5709 network card, My dell 610 have the same question, either is on the redhat and solaris.
                            replace the new mainboard , It is OK.
                            • 26. Re: Proliant DL380 / Broadcom 5709 net outage
                              807559
                              Very interesting you saw all C states of 1 when its bust. It's starting to look to me like its a sleep and driver resume issue perhaps.

                              Too early to tell, but haven't had a failure since disabling C-States in the BIOSs of all the suspect machines.

                              Atleast your processor the E5570 fits with my theory of only affecting the newer Xeon's that support the more advanced "Deep Sleep".

                              Also -- this might be of use, as well as powertop.

                              http://hub.opensolaris.org/bin/view/Project+tesla/Observability

                              Theres a Dtrace script there [halted.d] which also reports on Processor State information.

                              Would be good to knock up a script that uses the idle-state-transition Dtrace probe to notify on all state changes, timestamp them, and try and correlate state changes /w the ultimate failure of the BNX driver.

                              If I can be asked I might do that later. Wouldn't mind some help from Sun.com however, anyone fancy giving a hand?

                              FYI as well, I have seen this issue on both IPMP'ed systems [2 x bnx's] and straight single nic bnx machines. I also saw the limited arp's on the bnx0 interface, just some broadcast FF:.....FF:'s.


                              gzbodao -- I can't belive its a hardware issue, atleast on my side. I have had a whole bunch of machines running fine until a recent cluster, and then they are all randomly affected. Unless the patch cluster cooked the hardware [which I obviously dont believe to be the case], then it has to be an OS issue.


                              Straw poll -- Is anyone having this issue on any CPU that * IS NOT * an Intel Xeon E55xx ??? Are all the Dell's that are affected running the E55xx range?

                              Let me know your CPU's people!!

                              Cheers,
                              Jack.
                              • 27. Re: Proliant DL380 / Broadcom 5709 net outage
                                807559
                                Whoops -- got my CPU Specs around the wrong way.

                                The G5's that do not have the issue run Intel Xeon E5440's

                                The G6's run the Intel Xeon E5530's which do have the advanced [C6] Deep Sleep state.

                                We only see the issue on the G6's with E55xx class CPU's.
                                • 28. Re: Proliant DL380 / Broadcom 5709 net outage
                                  807559
                                  UPDATE --

                                  After disabling C-States on all G6 BIOS's we haven't had a single re-occurence of this issue in over 10 days.

                                  I'm filing this under "Worked Around For Now"

                                  Despite what Sun and HP seem to be telling everyone, let me tell you, that its C-State related. Turn it off in BIOS.

                                  As more and more HP customers patch or build their Solaris to U8 and beyond, this issue will steam roll.

                                  I would like to say a big no thank you very much to Oracle/Sun for no help at all, and we as an organisation will be moving to RHEL Linux in a orderly and planned migration at the next large software and hardware update cycle in the coming year[s].

                                  Laters and good luck.
                                  • 29. Re: Proliant DL380 / Broadcom 5709 net outage
                                    807559
                                    We are having the issue on dell r610 which has intel E5540 processors and Broadcom 5709C.

                                    We are running same version of osol on a dell R805 with amd processors and Broadcom 5708 without issue.

                                    As a side note, what is the downside to disabling c-state? just use more power? Is there a way to just disable the C6?