Your switching guys should allow you read only access to the switch the hosts are connected to. You should be able to at least confirm the MAC table is not being constantly be written and the MAC address of the host in question is still be tied to the same inbound and outbound ports of the adjacent switches.
The TCPDUMP not responding to pings may indicate the route to host being pinged is not known or at least "stale". This still doesn't rule out a switching issue. The default gateway is probably not on the same switch as your host having the issue.... Meaning.... the host stilll needs to know the MAC route/layer2 route to the default gateway IP. (layer 3). If the MAC table is constantly changing on the switch. Then this will affect your route to the default gateway. This is why I think you've got a switch problem. Get the MAC address of your default route by using arp. Then list the MAC table on the switch. Then find out what port is being used to forward the request to that MAC. Check every few minutes to see if the destination port has changed in the switch.
Also, I can't remember if you have tried this or not, but if you have a host on the same switch... and it can ping the host with not problem. (because its not jumping switches or being forward to the default route.... then this would indicate a switch forwarding problem.
Don't trust your network guys. No offense.... I'm a former network administrator.
I fully understand the phrase "Don't trust your network guys". You won't get any arguments from me on that.
I'm going to read over your response in detail, and assume you know much more about network than I do, and formulate a plan.
I want to remind you, though, of some detailed symptoms:
I've got about 100 servers, physical and virtual. ALL of the crazy symptoms I'm seeing are on Oracle Linux and Oracle VM server. It may be that I interact with those more than the other servers, and would see the symptom mostly there. But surely SOMEONE would be seeing the symptom elsewhere. I easily imagine that we have a switching problem, since the networking guys are working on a "DMZ redesign plan" right now. But from what I see, only Oracle Linux is affected by it.
While the OEL host is ignoring the incoming pings from my desktop, I can ping that same desktop from OEL, and responses come back. That tells me that the OEL box CAN communicate with my desktop.
At the same time that I can ping my desktop (and it cannot ping OEL), I can traceroute to my desktop, and the response is "network down". What the heck could cause that? If you're a network programming guy, read on: A syscall trace of the traceroute shows it getting an error ENETDOWN to its one and only sendto() call. From what I know, and from documentation, ENETDOWN means "the network interface you're trying to use is down". But it clearly is NOT down. Crazy.
Okay, I'm going to go read your reply in detail now. Thanks for your help.
To be honest. Seeing those words DMZ.... really makes me think that is a firewall in between you and the hosts. A true DMZ will always have a firewall. Even if the network guys are using the switches to truly segment traffic.... they are emulating a firewall environment.
Its does sound odd that the host can ping your desktop while at the same time refusing to respond to pings. In a firewalled/segment environment things like this happen all the time. Rules are based on source and destination patterns. The ping could be allowed to your desktop destination while being denied from your desktop destination.... but it doesn't make sense that rule would dynamically change.
A firewall does explain why you can ssh in the host and ping to your desktop while the host itself is denying the ping response. Ping and SSH are two entire different services. Ping being icmp and ssh being a TCP service on port 22. A firewall would have to be aware of both.
If you network guys are NATing source/destination addresses anywhere in the mix, the return route/NAT maybe entirely different than the original destination path to the host.... and the firewall rules would have to be defined differently based on the NAT. In other words there is a one way NAT to the host and the return is not being NATed. OR there could be a one way service level NAT and no return. Just a few thoughts.
Maybe it is worth investigating ARP flux
I had never heard of "arp flux". I have glanced over that page you posted, and haven't read it in detail yet. I will do so after I post this.
This page says arp flux "typically" affects only servers with multiple NICs on one segment. My siituation is this:
I'm getting the problem on my 2-node test cluster. Each of those nodes has ONE network cable plugged into the LAN>
I'm getting the problem on my 3-node prod cluster. Each of those nodes has one NIC for the management interface, and one NIC for the VMs to use. Both are plugged into the same LAN.
I'm getting the problem on three different Oracle Linux VMs that are running VMWare ESX. Each VM has ONE virtual NIC. It is true, however, that the ESX hosts have two NICs plugged into the LAN. But only SOME of the Oracle Linux VMs have the problem. All of them are running some version of the UEK2 kernel.
So, I'm doubtful that arp flux is the problem, but I'll go read the page you suggested. I will reply again, if I find out any more.
I believe (can't be sure) I now know the cause of this problem: It is a UEK2 kernel bug that is still not fixed, as of version 2.6.39-300.17.1.el5uek.
My postings on this forum and elsewhere had the good effect that a guy from Denmark, who is running SuSe Linux emailed me. He had the SAME PROBLEM. After we exchanged information, he opened a ticket with his SuSe support folks at Novell, and they said:
A change went into the kernel in version 3.0.<something> that drastically changed how route caching is done. This caused a number of problems, and it was reverted in a later patch. Here is a link to the kernel networking mailing list about the patch that reverts the original:
Apparently, UEK2 was forked from the mainline kernel (at 3.0.16, or so) after the trouble was introduced, and it's STILL THERE. (And I have never seen the problem is the original RHEL 2.6.18 kernel, nor in the UEK 2.6.32 kernel.)
The symptom manifests itself when your host is on a network with more than one gateway, and the default gateway redirects traffic to the proper gateway. The kernel change causes Linux to not properly handle the redirection. That's exactly the situation at my site, and also for the guy in Denmark who I've been talking with. One workaround is to install static routes in your server, so the router redirects never occur. I have done this, and it's too early to tell for sure, but it looks like the problem may be fixed.
I am not asserting as fact that the above is an accurate description of what's going on, but I have lots of circumstantial evidence that it IS.
I'll let you know when I know more.
Thank you Terry for all of the information.
Simple test is to boot using the non-UEK aka RedHat kernel to isolate this grief to UEK.
Update "/boot/grub/grub.conf" and/or select non-UEK from GRUB during boot.
There are some features unavailable when using the RedHat kernel e.g. OCFS,
OL6u1 non-UEK kernel OCFS2 omitted
Oh, I've done lots of testing with the older UEK kernel, which is 2.6.32. I have never seen the error there, nor on the old, old RHEL kernel. Nor on brand new compiled 3.5 and 3.6 kernels. I see it ONLY on UEK2.
The catch is that this problem happens only once in a while. It may hit me every day for a week, and then I never see it again for two weeks. So I can't be 100% certain it's UEK2.
But it is.
FYI check out Mukesh Rathor Xen Summit presentation "PVH: PV Guest in HVM container",
Excerpt from ~14 minutes into the presentation,
"push the car up the hill and see if we can reproduce the problem" ;)
I've never seen it. I can see how route caching could cause this but your route have to be changing.
You have to ask yourself..... "Why are my routes changing". Having multiple routers on the same network segment does not change what "route is cached". The host will always take the same route. For example... say you have a host on 10.0.0.10 and say that host forwards a packet destined for 10.0.1.10 to its default gateway of 10.0.0.1 and in turn that packet is forwarded to 10.0.0.2 to find the correct path to 10.0.1.0. Just because the return path changes in that 10.0.0.2 would NOT return the packet through 10.0.0.1..... (because 10.0.0.1 and 10.0.0.2 are on the same subnet as 10.0.0.10.)... doesn't mean the route cache has changed. The return response is still through 10.0.0.1.
I have seen issue like this before when firewalls are involved and have IPS services enabled. In such a scenario... the firewall might see a "man in the middle" attack.
Why not set your default gateway to the router that actually has a direct path without hooping?
It seems that we have the same problem here with OEL 6. I searched in the MOS but could not find anything about it. Has anybody something new with this issue?
We can absolutely reproduce it with UEK vs. RedHat Kernel.
I'm the original poster. This is STILL a problem on all our UEK2 machines. Even the newest kernel RPMs don't fix it. The problem, for us, occurs ONLY when the local network has a one or more gateways in addition to the detault gateway. We are working around the problem by installing static routes for all networks we need to get to. I can't understand why so few of us are seeing this problem.
Never had an issue with it and I have multiple gateways on the same local subnet.
Are you running multiple nics on the same subnet with more than one gateway defined?
Nope. I get the problem with one NIC, one subnet, if any traffic goes through the non-default gateway. Since almost (but not quite) nobody has this problem except me, I wonder if the default gateways on the local subnet are not sending proper redirect messages when the server sends a packet to the DG but is destined for a subnet NOT behind the DG. Or some such thing. In any case, when I install static routes for ALL packets I need to send, to my server always send packets to the proper gateway, I never get the problem.
thank you for your reply.
We run into the problem even with NO default gateway. Or did we understand something wrong:
[email@example.com ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:98:58:7c brd ff:ff:ff:ff:ff:ff
inet 10.72.47.169/24 brd 10.72.47.255 scope global eth0
inet6 fe80::250:56ff:fe98:587c/64 scope link
valid_lft forever preferred_lft forever
[firstname.lastname@example.org ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.72.47.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 eth0
172.16.10.0 10.72.47.1 255.255.255.0 UG 0 0 0 eth0
Do you know if there has been any ticket opened on My Oracle Support for this issue?