Proliant DL380 / Broadcom 5709 net outage

807559
    out of 8 x 3 month old servers i have had 3 outages where network access is completely gone requiring a reboot.

    setup
    - sol 10 latest with patches
    - hp BRCMbnx v4.6.2
    - onboard broadcom 5709 in link based ipmp config ( is this supported? )

    symptoms:
    - switch says both links are up, link lights are flashing
    - dladm show-dev link = up
    - no ping or network access of any kind ,in or out
    - physically pulling cable from bnx0 is not detected and link does not failover (ie dladm show-dev has link up when no cable inserted )
    - reboot is the only way to resolve

    Has anyone else seen any problems like this?
    Any thoughts on how i can diagnose from the console?
      • 1. Re: Proliant DL380 / Broadcom 5709 net outage
        807559
        Yes we have also seen the same problem 4 times across 2 servers. And as you have seen reboot was the only answer.

        HP have advised us to update the firmware and BIOS on the system and install the latest HP driver. We haven't been able to reproduce the issue so we are patching and flashing a system this afternoon and will monitor to see if the problem goes away.

        We have seen it on a system with and without ipmp. We where musing that if we used probe based ipmp we would at least detect the failure and fail over to the other network interface - we have manually failed broken bnx interfaces in ipmp pairs and the connection comes back. We haven't yet tested real probe based ipmp though.

        Driver is at:
        http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=3884082&swItem=MTX-95ceaca8bf194339b5c83e5bc6&prodNameId=3884083&swEnvOID=2023&swLang=13&taskId=135&mode=4&idx=1

        And we ended up using the latest firmware and BIOS from:
        http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=3949986&prodNameId=3949988&swEnvOID=1113&swLang=13&mode=2&taskId=135&swItem=MTX-a69a789f6bfc4b0fbe2d84c7c6


        The diagnosis I tried from the console, snoops unplumbing reblumbing interfaces etc all failed to either tell me anything or fix the problem.
        • 2. Re: Proliant DL380 / Broadcom 5709 net outage
          807559
          Hi James,
          thanks for responding.
          I updated firmware and drivers 4 days ago and unfortunately have had the same symptom again on 1 box so far.

          Firmware is now ver5020002 and driver is now 5.2.2

          Identical symptoms to before - can't ping, can't snoop, reboot the only solution.

          i have an e1000 on some boxes (unused) - do you know if i can use link based ipmp across adapter types? ie bnx0 and e1000g0 ?

          very frustrating - back to HP
          HP have also just informed us no that they are no longer supporting Solaris on HP after 1 July. They will honor existing contracts.

          Edited by: dhermans on 8/06/2010 10:53
          • 3. Re: Proliant DL380 / Broadcom 5709 net outage
            807559
            The system I updated last is still up so I'm disappointed that you have seen the problem again, as I guess we will hit it again eventually and I to will have to escalate within HP to get some traction on the issue. If I do get any further useful information I'll post an update here.

            I'm informed by colleges that they have done ipmp across nge and e1000g successfully in the past so it should work fine I imagine. Of course if your doing link based it's not going to work because when the bnx card fails it stays "linked up", so probe based ipmp would still be needed to fail from a failed bnx to a working e1000g card.

            I was aware of the Solaris support on HP situation. Some people within my company are due to meet with HP again later this month to further discuss the future of the Solaris OEM agreement HP currently have with Oracle.
            • 4. Re: Proliant DL380 / Broadcom 5709 net outage
              807559
              i logged the issue with broadcom, they quickly replied that it has nothing to do with MSI-X ( a lot of people on redhat getting disconnects on this card - refer https://bugzilla.redhat.com/show_bug.cgi?id=520888 )

              and that i should run:
              kstat -m bnx

              and send them the output...
              • 5. Re: Proliant DL380 / Broadcom 5709 net outage
                807559
                Thanks I had seen the RedHat article, but only read the bug when you pointed it out to me. It's useful as background but nice to have it confirmed as not our issue.

                I'm having problems triggering the issue to help the support people I am working with to at HP to reproduce (it always happens when we aren't looking). Do you have any idea if it is heavy load, or a particular set of conditions that cause the issues for you?

                Edited by: jameslegg on Jun 10, 2010 9:40 AM
                • 6. Re: Proliant DL380 / Broadcom 5709 net outage
                  807559
                  hi,

                  got a failure last night and captured the kstat output for broadcom - i can send to you if you're interested.

                  as to your question - system was completely idle - failed overnight - so have no idea what triggers it.

                  I thought it was load (each box has 4 zones) but for a server of this power the nic is NOT doing any real work. I've seen failures mostly during the day so thought it was a particular type of traffic. I have 8 servers and only two have had multiple failures each. and these two aren't the busy ones..

                  very frustrating - going live in a month so i may be using e1000 if i don't get a good answer very quickly
                  • 7. Re: Proliant DL380 / Broadcom 5709 net outage
                    807559
                    Hello,

                    Yes please I would be interested in the kstat output, on the next failure I'd like to compare and see if any differences/similarities appeared if I spot anything I will share.

                    Do you have a case open with HP as well as Broadcom about this? We have our call escalated with HP at the moment and if your experiencing the same problem we should confirm that they are aware of both the calls.

                    It's interesting that you see failures when idle, we have also seen failures during quite times. Most of our boxes are pre-live so generally built on the network but applications have not been installed and configured yet. We also have between 2 and 4 zones on the systems, but again not especially busy at the moment.

                    We are also having to consider different network cards and e1000g does seem the lesser of the network card evils at the moment. Our only other option is entirely different systems.
                    • 8. Re: Proliant DL380 / Broadcom 5709 net outage
                      807559
                      Hi,

                      I have a broadcom case 322723 and have sent the kstat output from below. No response for 2 days. They actually immediately closed the case as resolved which is a bit disappointing.

                      My HP case is 4615849501 and really only just logged this due contract issues.

                      Also, I am yet to see this on 4 boxes without zones but these boxes are also pretty much idle.

                      We are weeks away from prod'n so e1000 is looking like the way to go or us.

                      kstat -m bnx (while offline) - sorry quite long - had to truncate bnx1 and down - let me know an alternative method and i'll send

                      module: bnx instance: 0
                      name: bnx0 class: net
                           brdcstrcv 80885
                           brdcstxmt 44133
                           collisions 0
                           crtime 152.483290789
                           ierrors 0
                           ifspeed 1000000000
                           ipackets 161447675
                           ipackets64 161447675
                           multircv 0
                           multixmt 0
                           norcvbuf 0
                           noxmtbuf 0
                           obytes 3077603625
                           obytes64 106156818729
                           oerrors 0
                           opackets 185412859
                           opackets64 185412859
                           rbytes 327161841
                           rbytes64 64751671281
                           snaptime 851061.110986364
                           unknowns 0

                      module: bnx instance: 0
                      name: fm class: misc
                           acc_err 0
                           crtime 152.367207809
                           dma_err 0
                           erpt_dropped 0
                           fm_cache_full 0
                           fm_cache_miss 0
                           snaptime 851061.111403506

                      module: bnx instance: 0
                      name: mac class: net
                           adv_cap_1000fdx 1
                           adv_cap_1000hdx 1
                           adv_cap_100fdx 1
                           adv_cap_100hdx 1
                           adv_cap_10fdx 1
                           adv_cap_10hdx 1
                           adv_cap_asmpause 1
                           adv_cap_autoneg 1
                           adv_cap_pause 1
                           align_errors 0
                           brdcstrcv 80885
                           brdcstxmt 44133
                           cap_1000fdx 1
                           cap_1000hdx 1
                           cap_100fdx 1
                           cap_100hdx 1
                           cap_10fdx 1
                           cap_10hdx 1
                           cap_asmpause 1
                           cap_autoneg 1
                           cap_pause 1
                           carrier_errors 0
                           collisions 0
                           crtime 152.366622095
                           defer_xmts 0
                           ex_collsions 0
                           fcs_errors 0
                           first_collsions 0
                           ierrors 0
                           ifspeed 1000000000
                           ipackets 161447675
                           ipackets64 161447675
                           link_asmpause 0
                           link_autoneg 1
                           link_duplex 2
                           link_pause 0
                           link_state 1
                           link_up 1
                           lp_cap_1000fdx 1
                           lp_cap_1000hdx 1
                           lp_cap_100fdx 1
                           lp_cap_100hdx 1
                           lp_cap_10fdx 1
                           lp_cap_10hdx 1
                           lp_cap_asmpause 0
                           lp_cap_autoneg 1
                           lp_cap_pause 0
                           macrcv_errors 0
                           macxmt_errors 0
                           multi_collsions 0
                           multircv 0
                           multixmt 0
                           norcvbuf 0
                           noxmtbuf 0
                           obytes 3077603625
                           obytes64 106156818729
                           oerrors 0
                           opackets 185412859
                           opackets64 185412859
                           promisc 0
                           rbytes 327161841
                           rbytes64 64751671281
                           snaptime 851061.111552127
                           sqe_errors 0
                           toolong_errors 0
                           tx_late_collsions 0
                           unknowns 0
                           xcvr_addr 1
                           xcvr_id 21217224
                           xcvr_inuse 7

                      module: bnx instance: 1
                      name: bnx1 class: net
                           brdcstrcv 125014
                           brdcstxmt 0
                           collisions 0
                           crtime 152.640676498
                           ierrors 0
                           ifspeed 1000000000
                           ipackets 125014
                           ipackets64 125014
                           multircv 0
                           multixmt 0
                           norcvbuf 0
                           noxmtbuf 0
                           obytes 0
                           obytes64 0
                           oerrors 0
                           opackets 0
                           opackets64 0
                           rbytes 8099008
                           rbytes64 8099008
                           snaptime 851061.112658052
                           unknowns 124816
                      • 9. Re: Proliant DL380 / Broadcom 5709 net outage
                        807559
                        We also just saw the problem again on a fully patched system (both firmware and latest driver). It took 11 days to occur for us, and I gathered as much information as I could from the system for HP. I've taken the liberty of passing you case reference on to the people working our escalation to try and increase the visibility of the issue. Unfortunately the system I was examining stopped responding to input midway though the data collection process so I never got kstats to compare against yours.

                        In my own Google searches I found the following blog entry that links to a Solaris bug, which appears on the face of it to be

                        http://blogs.everycity.co.uk/alasdair/2010/06/broadcom-nics-dropping-out-on-solaris-10/

                        http://bugs.opensolaris.org/bugdatabase/view_bug.do;jsessionid=c5a57cf14be1ff0bba06d2781344?bug_id=6926051

                        I am going to see if the reproduction details (using NFS heavily) in the bug allow me to reproduce in a test environment.
                        • 10. Re: Proliant DL380 / Broadcom 5709 net outage
                          807559
                          Hi James,

                          great work finding the opensolaris links - i searched extensively without finding anything

                          have you seen it on the older firmware? i definitely have.

                          i'll try some load testing as well - what NFS load generator can you suggest?

                          i have finally at least sent HP some data, we'll see what transpires..
                          • 11. Re: Proliant DL380 / Broadcom 5709 net outage
                            807559
                            I was actually searching for the best way to confirm the firmware version of the nic when I stumbled upon the blog entry, and it was only written on Monday.

                            I have seen a failure on the older version of the firmware (Ver4060004) and the older version of the driver (v4.6.2), as of yet I haven't seen a failure with a mixture, but with it taking 11 days for a failure to happen that doesn't mean much at all.

                            I've never used an NFS load generator, I was going to start with some simple large file copies and get more complicated from then on if I need to.

                            As always if anything interesting happens I'll let you know.
                            • 12. Re: Proliant DL380 / Broadcom 5709 net outage
                              807559
                              I managed to reproduce the issue last night by copying lots of 3GB files across an NFS share in a loop (left it running overnight). http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6938878 mentions a higher version (6.0.1) of an unreleased Broadcom driver that integrated into OpenSolaris build 143 that I am hoping our vendor will be able to provide, either that or an IDR fix for http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6926051 that will work with S10u8.

                              Edited by: jameslegg on Jun 17, 2010 12:16 PM
                              • 13. Re: Proliant DL380 / Broadcom 5709 net outage
                                807559
                                I had another failure over the weekend on a basically idle host, so 3 out of 8 have now dropped out.

                                HP recommended latest recommended patches and kernel which is a change management nightmare. I was running recommended from Apr02 but have installed recommended Jun02 + 142901-13 on one box

                                pounded the box overnight by doing continual wget of a 1gb file - but nothing happened - think i need a faster webserver as it killed my sunfire v210

                                the wheel keeps turning - no update from HP for 3 days
                                • 14. Re: Proliant DL380 / Broadcom 5709 net outage
                                  807559
                                  So far I have seen failures take anywhere from 24hours to 12 days to occur. My heavy NFS test traffic hasn't yet provoked it more than once. Currently I'm testing an soak testing an e1000g card (The HP NC364T) as an alternative option if nobody can provide us a fix, but I am still pushing for a fix.

                                  -James
                                  1 2 3 4 上一个 下一个