I can fairly easily re-create this. I am getting it on one OEL 5.x server, also, so it's not unique to Oracle VM. This is really a Linux problem, as best I can tell.
I've been fighting it for a while, and I've just gotten a network guy to help me isolate it, and we think it "just has to be" a Linux problem. The symptom is:
From my desktop, I SSH to one (any one of the five) Oracle VM server.
I exit SSH.
I try to SSH into it again.
I ping the server from my desktop.
The pings fail.
If I ping that server from my desktop continuously, it will NEVER get a ping reply. I've let it run for DAYS, and no answer.
If I STOP pinging for 10 minutes, and then ping it again, ALL IS WELL.
During the time the VM server is refusiing to answer my pings, all other networking seems fine. Specifically, I can SSH to it from elsewhere, and ping my desktop and it works perfectly!
My network guy and I traced all traffic into and out of the server (remember, all 5 of them do it), and we see the ICMP requests getting to the server, but no ICMP response coming out.
If anyone has any idea why this might happen, or what I can do to isolate it further, or how to make it stop, please let me know.
I have posted this question in other Linux forums, and I get a slew of standard advice to check my switches and make sure the NICs are configured properly, and make sure the ARP cache isn't bad, and such things. None of that seems related to why VM Server stops answering pings from ONE machine, and refuses to ever answer them again, until the pings stop for a while. after which it's all good again.
Update: I should have posted one addition piece of evidence. I can' do traceroute from Oracle VM, but there is no traceroute installed. But I CAN test with traceroute on the one Linux box that's having the SAME symptom as my Oracle VM servers. What I see is that ping works and traceroute fails, when going to the same destination. Watch:
# ping 192.168.118.22
PING 192.168.118.22 (192.168.118.22) 56(84) bytes of data.
64 bytes from 192.168.118.22: icmp_seq=1 ttl=127 time=0.546 ms
64 bytes from 192.168.118.22: icmp_seq=2 ttl=127 time=0.430 ms
# traceroute 192.168.118.22
traceroute to 192.168.118.22 (192.168.118.22), 30 hops max, 40 byte packets
send: Network is down
Network is DOWN?? I just pinged it! Furthermore, I can traceroute to another box on the very same destination network:
# traceroute 192.168.118.23
traceroute to 192.168.118.23 (192.168.118.23), 30 hops max, 40 byte packets
1 f200.acbl.net (172.16.0.5) 0.525 ms 0.486 ms 0.465 ms
2 172.16.16.253 (172.16.16.253) 3.045 ms 3.428 ms 3.682 ms
3 192.168.118.23 (192.168.118.23) 0.469 ms 0.513 ms 0.508 ms
Insane! And if I trace the traceroute, it really is getting ENETDOWN back from his sendto() call:
The first thing I would think of, is a driver issue. I have never experienced this in such a way. I have come across situations, where the ARP cache of my switch(es) would go wild and cause all kinds of weird connections troubles, similar to the ones you are describing, but it cannot be possibly a "Linux problem" in general.
I'd rather suspect a BIOS or driver fimware issue. I remeber having some strage networking issues on some of my Dell Servers which are nowadays equipped with bcom chips (as most the the comodity server are) and the firmware installed by Dell led to a slew of issues, in fact for three incarnations of the firmware in a row, until Dell decided to downgrade the NIC firrmware to some former version (not without actually increase the firmware version number, though - funny guys!)
Does the OEL VM have multiple virtual nics? and more than one default route defined? In other words. Do you have the VM connected to different VLAN subnets and you have a default route defined on both networks? If you do, and you're using adaptive routing... then you may see crazy things like this happen.
Also, can you ping the default gateway on the subnet of VM server from you desktop..... when the server is refusing to respond to the pings?
And this same symptom happens on physical servers with only one connection to my LAN.
Having said that I think we can rule out VM Server inconsistencies and need to take a hard look at:
- Your switches, thinking it could be autonegotiate going bad when talking to the NICs on the linux systems, but if you also see collisions, CRC errors, you may have suspect cabling or a bad firmware rev on the switches.
- The linux build process you are using to see if there is something systemic you are replicating across servers you are building - like iptables inadvertently left switched on, like selinux enabled and doing silly things behind the scenes etc. Status changes on interface links, full receive buffers on the NIC or tcp windows going a bit nutty should bring up something informative in the dmesg and /var/log/messages on these systems.
1. This is NOT just an Oracle VM problem. I have one OEL 5 Linux server, NOT hosted on Oracle VM, that has the same problem.
2. I have run tcpdump on the OEL 5 box, that has the SAME problem as my Oracle VM servers have, and when the machine gets "hosed", and will not reply to pings from one machine on another network, but works perfectly with another machine on that same other network, what I see is that tcpdump shows the ping's ICMP request packets arriving at the server. The server is simply NOT REPLYING. Question: What could possibly cause that?
3. I have discovered that when it gets "hung" in this way, the entries in the route cache, as displayed my "netstat -nrC", for the machine that works fine and for the machine whose pings are ignored, look exactly the same to me. Yet, if I flush the route cache with "ip route flush cache", the problem instantly goes away.
Does a laptop or something exhibit the same behavior when plugged into the same switch, using the same IP address/mask/gateway as your OEL box?
Your traceroute shows me you got a router or something in between the machines your testing from, and the machine your testing on. I would rule out that issue by plugging directly into the same physical switch as the OEL machine and redo your testing.
You say "same network"..... are they on the "same physical switch"?
Seems like you are convinced this is an issue with Oracle VM and OEL 5.
One other thing I thought of.. Do you have a firewall in between you and the destination network? Are you accessing it over a VPN? I've seen conflicting rules on firewalls cause something similar. Especially it that firewall is load balancing different paths to the destination network.
It just occured to me you said how if you stop pinging, in 10 minutes you loose everything. Thats a classic case of your switches purging their mac address tables. I would say its time you talk to whomever runs your network and start investigating at the switching level to see if there is a NIC / port / mac issue.
I gotta think this is a host problem, and not a switch problem, because:
1. I've got a data center with many servers in it, Oracle Linux, Windows, AIX, and VMWare ESX. ONLY Oracle Linux shows the problem.
2. When a host gets hosed, tcpdump shows that the ICMP requests from my desktop-1 ARE arriving, but no ICMP response is being sent. Why would he disregard the pings? No, there is no firewall involved anywhere.
3. When the host is hosed, is is able to ping, AND get a response from, desktop-1, even though pings from desktop-1 are not being answered.
3. When the host is hosed, and is ignoring pings (and other traffic, like tcp syn packets) from diesktop-1, the host (1) can ping desktop-1 (same network as desktop-1, but (2) traceroute to desktop-2 FAILS, saying "network is down". That is completely illogical to me. The ONE NIC in the box is certainly not down, because I'm SSHed into it, doing the pings to desktop-1 and -2.
4. Something new: If I enter "ip route flush cache" on the host when it's hosed, the error clears up instantly. If I dump the route cache with "netstat -nrC", I don't see anything obviously bad. The only route defined in the host is its default route.
I have questions open to the Linux kernel experts, and others, but have still not gotten any help to speak of.