Hi. I have some odd ipmp behaviour I can't explain - I wonder if anyone has run into similar issues or can otherwise shed light?
I have a shiny new Netra x4250 running Sol10 10/09. What I'm trying to configure is as follows:
e1000g0 as a "management" interface
e1000g1 and e1000g2 as an active/standby failover ipmp group with probe based failure detection.
The server doesn't have a default router, but it does have five probe targets on the same LAN and in the same subnet, configured as static routes via a startup script.
The problem is that when the system boots, the mpathd process marks the standby interface as FAILED with the message:
Sep 9 17:48:45 server1 in.mpathd: NIC failure detected on e1000g2 of group frontend
An ifconfig -a at this point looks like:
[root@server1]/root #ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.10.10.51 netmask ffffff00 broadcast 10.10.10.255
e1000g1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 192.168.1.2 netmask ffffff00 broadcast 192.168.1.255
e1000g1:1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3
inet 192.168.1.100 netmask ffffff00 broadcast 192.168.1.255
e1000g2: flags=39040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FAILED,STANDBY> mtu 1500 index 4
inet 192.168.1.101 netmask ffffff00 broadcast 192.168.1.255
...and a netstat -rn shows:
[root@server1]/root #netstat -rn
Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
10.10.10.0 10.10.10.51 U 1 2 e1000g0
192.168.1.0 192.168.1.2 U 1 10 e1000g1
192.168.1.0 192.168.1.2 U 1 0 e1000g1:1
192.168.1.0 192.168.1.2 U 1 15 e1000g2
192.168.1.3 192.168.1.3 UGH 1 0
192.168.1.4 192.168.1.4 UGH 1 0
192.168.1.5 192.168.1.5 UGH 1 0
192.168.1.6 192.168.1.6 UGH 1 0
192.168.1.7 192.168.1.7 UGH 1 0
127.0.0.1 127.0.0.1 UH 1 0 lo0
If I try detaching the standby inteface and re-attaching it using if_mpadm -d and if_mpadm -r, mpath.d reports the same NIC failure immediately.
If I unplumb and reconfigure the standby interface (e1000g2) then pkill -HUP mpath, the ifconfig output for the standby inteface becomes:
e1000g2: flags=69040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 1500 index 5
inet 192.168.1.101 netmask ffffff00 broadcast 192.168.1.255
...which looks a bit healthier but a snoop shows zero traffic on that interface and if I pull the active patch lead, it doesn't fail over.
As a sanity check, if I unplumb both interfaces and configure e1000g2 (i.e. the one that is failing) as a normal, non-ipmp interface, using the test IP 192.168.1.101, I can ping all five probe target IP's fine.
To complete the picture, here are the other relevant bits of config:
#pragma ident "@(#)mpathd.dfl 1.2 00/07/17 SMI"
# Time taken by mpathd to detect a NIC failure in ms. The minimum time
# that can be specified is 100 ms.
# Failback is enabled by default. To disable failback turn off this option
# By default only interfaces configured as part of multipathing groups
# are tracked. Turn off this option to track all network interfaces
# on the system
netmask + broadcast + group frontend up \
addif 192.168.1.100 netmask + broadcast + deprecated -failover up
192.168.1.101 netmask + broadcast + deprecated group frontend -failover standby up
Any help or thoughts much appreciated.
After some more digging it gets wierder...
If I snoop on the active interface (e1000g1), I can see pings going out to the ipmp probe targets as you would expect.
I also see pings going to (and coming back from) the probe targets from the test IP of the failed interface (192.168.1.101)! Furthermore (and possibly even wierder) those pings have the source MAC address of the failed interface too.
Is it possible that snooping on one interface (i.e. snoop -d e1000g1) could pick up ethernet frames from a different inteface (i.e. e1000g2)? It seems pretty unlikely to me - kind of defeats the object of specifying an interface to snoop on. If I snoop -d e1000g2 (the failed interface) I get nothing.
So if those pings relating to the standby test address are going out onto the LAN from the active i/f with a source MAC of the standby i/f, the LAN switch is going to l learn that MAC belongs to the active i/f. Therefore the ping replies will be recieved on the active i/f (which I see in the snoop). So how on earth is in.mpathd going to know that the standby NIC is working and can be brought back into service?
We are having the same issue here. We have two interfaces e1000g1 and e1000g2 in the same ipmp group with two virtual interfaces running on the active interface. On one of our x4250 systems this problem is now not reproducible. On the other the e1000g2 interface is FAILED almost immediately after being added to the group.
If you or anyone else has information on this issue please post it here, this is causing us grief.
If you can always reproduce the problem on one and never on the other, the obvious question is what is different between your two servers?
I'm making some headway, have had the below pointed out to me by Oracle support:
...which outlines a known IPMP bug that my symptoms fit and is fixed in a kernel patch.
I'm in the process of applying patches, will post an update when done.
On the system that works I've got 141445-09 on the system which your reference states may cause this issue, I've also already got 142901-10 which the same article states should fix it.
On the system that doesn't I only have 141445-09....I'll be adding this fix on the affected system today.
We were facing this problem and the metalink note 1021262.1 helped. Our x86 system running Solaris 10 u8 did have the 141445-09 patch but did NOT have the 142901 patch. After applying patch 142901-02, the problem was resolved.
... and this resurrected ancient post is now locked.
It was originally a discussion on the old Sun forum web site.
The 2010 posting dates represent when it was migrated to the Oracle forum site.
None of the original posters ever registered to the OTN forums, thus there are no individual poster usernames. None of them will know the latest post exists.