此内容已被标记为最终。 显示 46 条回复
Yes we have also seen the same problem 4 times across 2 servers. And as you have seen reboot was the only answer.
HP have advised us to update the firmware and BIOS on the system and install the latest HP driver. We haven't been able to reproduce the issue so we are patching and flashing a system this afternoon and will monitor to see if the problem goes away.
We have seen it on a system with and without ipmp. We where musing that if we used probe based ipmp we would at least detect the failure and fail over to the other network interface - we have manually failed broken bnx interfaces in ipmp pairs and the connection comes back. We haven't yet tested real probe based ipmp though.
Driver is at:
And we ended up using the latest firmware and BIOS from:
The diagnosis I tried from the console, snoops unplumbing reblumbing interfaces etc all failed to either tell me anything or fix the problem.
thanks for responding.
I updated firmware and drivers 4 days ago and unfortunately have had the same symptom again on 1 box so far.
Firmware is now ver5020002 and driver is now 5.2.2
Identical symptoms to before - can't ping, can't snoop, reboot the only solution.
i have an e1000 on some boxes (unused) - do you know if i can use link based ipmp across adapter types? ie bnx0 and e1000g0 ?
very frustrating - back to HP
HP have also just informed us no that they are no longer supporting Solaris on HP after 1 July. They will honor existing contracts.
Edited by: dhermans on 8/06/2010 10:53
The system I updated last is still up so I'm disappointed that you have seen the problem again, as I guess we will hit it again eventually and I to will have to escalate within HP to get some traction on the issue. If I do get any further useful information I'll post an update here.
I'm informed by colleges that they have done ipmp across nge and e1000g successfully in the past so it should work fine I imagine. Of course if your doing link based it's not going to work because when the bnx card fails it stays "linked up", so probe based ipmp would still be needed to fail from a failed bnx to a working e1000g card.
I was aware of the Solaris support on HP situation. Some people within my company are due to meet with HP again later this month to further discuss the future of the Solaris OEM agreement HP currently have with Oracle.
i logged the issue with broadcom, they quickly replied that it has nothing to do with MSI-X ( a lot of people on redhat getting disconnects on this card - refer https://bugzilla.redhat.com/show_bug.cgi?id=520888 )
and that i should run:
kstat -m bnx
and send them the output...
Thanks I had seen the RedHat article, but only read the bug when you pointed it out to me. It's useful as background but nice to have it confirmed as not our issue.
I'm having problems triggering the issue to help the support people I am working with to at HP to reproduce (it always happens when we aren't looking). Do you have any idea if it is heavy load, or a particular set of conditions that cause the issues for you?
Edited by: jameslegg on Jun 10, 2010 9:40 AM
got a failure last night and captured the kstat output for broadcom - i can send to you if you're interested.
as to your question - system was completely idle - failed overnight - so have no idea what triggers it.
I thought it was load (each box has 4 zones) but for a server of this power the nic is NOT doing any real work. I've seen failures mostly during the day so thought it was a particular type of traffic. I have 8 servers and only two have had multiple failures each. and these two aren't the busy ones..
very frustrating - going live in a month so i may be using e1000 if i don't get a good answer very quickly
Yes please I would be interested in the kstat output, on the next failure I'd like to compare and see if any differences/similarities appeared if I spot anything I will share.
Do you have a case open with HP as well as Broadcom about this? We have our call escalated with HP at the moment and if your experiencing the same problem we should confirm that they are aware of both the calls.
It's interesting that you see failures when idle, we have also seen failures during quite times. Most of our boxes are pre-live so generally built on the network but applications have not been installed and configured yet. We also have between 2 and 4 zones on the systems, but again not especially busy at the moment.
We are also having to consider different network cards and e1000g does seem the lesser of the network card evils at the moment. Our only other option is entirely different systems.
I have a broadcom case 322723 and have sent the kstat output from below. No response for 2 days. They actually immediately closed the case as resolved which is a bit disappointing.
My HP case is 4615849501 and really only just logged this due contract issues.
Also, I am yet to see this on 4 boxes without zones but these boxes are also pretty much idle.
We are weeks away from prod'n so e1000 is looking like the way to go or us.
kstat -m bnx (while offline) - sorry quite long - had to truncate bnx1 and down - let me know an alternative method and i'll send
module: bnx instance: 0
name: bnx0 class: net
module: bnx instance: 0
name: fm class: misc
module: bnx instance: 0
name: mac class: net
module: bnx instance: 1
name: bnx1 class: net
We also just saw the problem again on a fully patched system (both firmware and latest driver). It took 11 days to occur for us, and I gathered as much information as I could from the system for HP. I've taken the liberty of passing you case reference on to the people working our escalation to try and increase the visibility of the issue. Unfortunately the system I was examining stopped responding to input midway though the data collection process so I never got kstats to compare against yours.
In my own Google searches I found the following blog entry that links to a Solaris bug, which appears on the face of it to be
I am going to see if the reproduction details (using NFS heavily) in the bug allow me to reproduce in a test environment.
great work finding the opensolaris links - i searched extensively without finding anything
have you seen it on the older firmware? i definitely have.
i'll try some load testing as well - what NFS load generator can you suggest?
i have finally at least sent HP some data, we'll see what transpires..
I was actually searching for the best way to confirm the firmware version of the nic when I stumbled upon the blog entry, and it was only written on Monday.
I have seen a failure on the older version of the firmware (Ver4060004) and the older version of the driver (v4.6.2), as of yet I haven't seen a failure with a mixture, but with it taking 11 days for a failure to happen that doesn't mean much at all.
I've never used an NFS load generator, I was going to start with some simple large file copies and get more complicated from then on if I need to.
As always if anything interesting happens I'll let you know.
I managed to reproduce the issue last night by copying lots of 3GB files across an NFS share in a loop (left it running overnight). http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6938878 mentions a higher version (6.0.1) of an unreleased Broadcom driver that integrated into OpenSolaris build 143 that I am hoping our vendor will be able to provide, either that or an IDR fix for http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6926051 that will work with S10u8.
Edited by: jameslegg on Jun 17, 2010 12:16 PM
I had another failure over the weekend on a basically idle host, so 3 out of 8 have now dropped out.
HP recommended latest recommended patches and kernel which is a change management nightmare. I was running recommended from Apr02 but have installed recommended Jun02 + 142901-13 on one box
pounded the box overnight by doing continual wget of a 1gb file - but nothing happened - think i need a faster webserver as it killed my sunfire v210
the wheel keeps turning - no update from HP for 3 days
So far I have seen failures take anywhere from 24hours to 12 days to occur. My heavy NFS test traffic hasn't yet provoked it more than once. Currently I'm testing an soak testing an e1000g card (The HP NC364T) as an alternative option if nobody can provide us a fix, but I am still pushing for a fix.