1 2 3 Previous Next 41 Replies Latest reply: Feb 27, 2014 9:25 AM by user12273962 RSS

ALL FIVE of my VM servers have the same, very strange network problem

Terry Phelps Newbie
Currently Being Moderated
I can fairly easily re-create this. I am getting it on one OEL 5.x server, also, so it's not unique to Oracle VM. This is really a Linux problem, as best I can tell.

I've been fighting it for a while, and I've just gotten a network guy to help me isolate it, and we think it "just has to be" a Linux problem. The symptom is:

From my desktop, I SSH to one (any one of the five) Oracle VM server.
I exit SSH.
I try to SSH into it again.
It fails.
I ping the server from my desktop.
The pings fail.
If I ping that server from my desktop continuously, it will NEVER get a ping reply. I've let it run for DAYS, and no answer.
If I STOP pinging for 10 minutes, and then ping it again, ALL IS WELL.

During the time the VM server is refusiing to answer my pings, all other networking seems fine. Specifically, I can SSH to it from elsewhere, and ping my desktop and it works perfectly!

My network guy and I traced all traffic into and out of the server (remember, all 5 of them do it), and we see the ICMP requests getting to the server, but no ICMP response coming out.

If anyone has any idea why this might happen, or what I can do to isolate it further, or how to make it stop, please let me know.

I have posted this question in other Linux forums, and I get a slew of standard advice to check my switches and make sure the NICs are configured properly, and make sure the ARP cache isn't bad, and such things. None of that seems related to why VM Server stops answering pings from ONE machine, and refuses to ever answer them again, until the pings stop for a while. after which it's all good again.

Update: I should have posted one addition piece of evidence. I can' do traceroute from Oracle VM, but there is no traceroute installed. But I CAN test with traceroute on the one Linux box that's having the SAME symptom as my Oracle VM servers. What I see is that ping works and traceroute fails, when going to the same destination. Watch:

# ping 192.168.118.22
PING 192.168.118.22 (192.168.118.22) 56(84) bytes of data.
64 bytes from 192.168.118.22: icmp_seq=1 ttl=127 time=0.546 ms
64 bytes from 192.168.118.22: icmp_seq=2 ttl=127 time=0.430 ms

# traceroute 192.168.118.22
traceroute to 192.168.118.22 (192.168.118.22), 30 hops max, 40 byte packets
send: Network is down

Network is DOWN?? I just pinged it! Furthermore, I can traceroute to another box on the very same destination network:

# traceroute 192.168.118.23
traceroute to 192.168.118.23 (192.168.118.23), 30 hops max, 40 byte packets
1 f200.acbl.net (172.16.0.5) 0.525 ms 0.486 ms 0.465 ms
2 172.16.16.253 (172.16.16.253) 3.045 ms 3.428 ms 3.682 ms
3 192.168.118.23 (192.168.118.23) 0.469 ms 0.513 ms 0.508 ms

Insane! And if I trace the traceroute, it really is getting ENETDOWN back from his sendto() call:

connect(3, {sa_family=AF_INET, sin_port=htons(33434), sin_addr=inet_addr("192.168.118.22")}, 28) = 0
sendto(3, "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_"..., 40, 0, NULL, 0) = -1 ENETDOWN (Network is down)

Edited by: Terry Phelps on Jul 11, 2012 2:03 PM
  • 1. Re: ALL FIVE of my VM servers have the same, very strange network problem
    budachst Pro
    Currently Being Moderated
    The first thing I would think of, is a driver issue. I have never experienced this in such a way. I have come across situations, where the ARP cache of my switch(es) would go wild and cause all kinds of weird connections troubles, similar to the ones you are describing, but it cannot be possibly a "Linux problem" in general.

    I'd rather suspect a BIOS or driver fimware issue. I remeber having some strage networking issues on some of my Dell Servers which are nowadays equipped with bcom chips (as most the the comodity server are) and the firmware installed by Dell led to a slew of issues, in fact for three incarnations of the firmware in a row, until Dell decided to downgrade the NIC firrmware to some former version (not without actually increase the firmware version number, though - funny guys!)
  • 2. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user12273962 Pro
    Currently Being Moderated
    Does the OEL VM have multiple virtual nics? and more than one default route defined? In other words. Do you have the VM connected to different VLAN subnets and you have a default route defined on both networks? If you do, and you're using adaptive routing... then you may see crazy things like this happen.

    Also, can you ping the default gateway on the subnet of VM server from you desktop..... when the server is refusing to respond to the pings?
  • 3. Re: ALL FIVE of my VM servers have the same, very strange network problem
    Terry Phelps Newbie
    Currently Being Moderated
    No, the OEL VM has one virtual NIC, therefore only one default route. And this same symptom happens on physical servers with only one connection to my LAN.

    Yes, I can ping my default router and any other machine on the network with my Oracle VM server, when it won't answer pings from me.

    The strange thing, remember, is that it stops talking, and won't talk again until after a period of no traffic at all from the other device. THEN, it's all better again.
  • 4. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user12273962 Pro
    Currently Being Moderated
    Are you using any type of link aggregation or bonding? or MTU over 1500?

    On you VM server. If you run a ifconfig bond0 on the server that you can't ping from time to time... do you see any errors?
  • 5. Re: ALL FIVE of my VM servers have the same, very strange network problem
    929532 Newbie
    Currently Being Moderated
    And this same symptom happens on physical servers with only one connection to my LAN.

    Having said that I think we can rule out VM Server inconsistencies and need to take a hard look at:

    - Your switches, thinking it could be autonegotiate going bad when talking to the NICs on the linux systems, but if you also see collisions, CRC errors, you may have suspect cabling or a bad firmware rev on the switches.

    - The linux build process you are using to see if there is something systemic you are replicating across servers you are building - like iptables inadvertently left switched on, like selinux enabled and doing silly things behind the scenes etc. Status changes on interface links, full receive buffers on the NIC or tcp windows going a bit nutty should bring up something informative in the dmesg and /var/log/messages on these systems.
  • 6. Re: ALL FIVE of my VM servers have the same, very strange network problem
    Terry Phelps Newbie
    Currently Being Moderated
    I have an update to this problem:

    1. This is NOT just an Oracle VM problem. I have one OEL 5 Linux server, NOT hosted on Oracle VM, that has the same problem.

    2. I have run tcpdump on the OEL 5 box, that has the SAME problem as my Oracle VM servers have, and when the machine gets "hosed", and will not reply to pings from one machine on another network, but works perfectly with another machine on that same other network, what I see is that tcpdump shows the ping's ICMP request packets arriving at the server. The server is simply NOT REPLYING. Question: What could possibly cause that?

    3. I have discovered that when it gets "hung" in this way, the entries in the route cache, as displayed my "netstat -nrC", for the machine that works fine and for the machine whose pings are ignored, look exactly the same to me. Yet, if I flush the route cache with "ip route flush cache", the problem instantly goes away.

    Any ideas what I should look for next?
  • 7. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user157995 Explorer
    Currently Being Moderated
    Does a laptop or something exhibit the same behavior when plugged into the same switch, using the same IP address/mask/gateway as your OEL box?

    Your traceroute shows me you got a router or something in between the machines your testing from, and the machine your testing on. I would rule out that issue by plugging directly into the same physical switch as the OEL machine and redo your testing.
  • 8. Re: ALL FIVE of my VM servers have the same, very strange network problem
    Terry Phelps Newbie
    Currently Being Moderated
    The problem seems be happening ONLY on OEL machines, and not on any Windows machine, or VMWare ESX host. And not all OEL machines exhibit the symptom. But ALL of my OVM 3.1.1 server have the problem.

    Yes, there certainly is a router between VM Server and the machine having the problem. I have never seen the problem between an Oracle VM Server and another machine on the same network.

    But the questions are:
    What could possibly trigger this problem?
    What is wrong that is fixed by flushing the route cache?
    How can I make it stop?
  • 9. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user12273962 Pro
    Currently Being Moderated
    What Dave suggested is a good way to troubleshoot the issue.

    The only other thing I can think of its might be a nic driver issue. What NIC are you using? Are you loading drivers for the NIC or are they native to the kernel?
  • 10. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user12273962 Pro
    Currently Being Moderated
    You say "same network"..... are they on the "same physical switch"?

    Seems like you are convinced this is an issue with Oracle VM and OEL 5.

    One other thing I thought of.. Do you have a firewall in between you and the destination network? Are you accessing it over a VPN? I've seen conflicting rules on firewalls cause something similar. Especially it that firewall is load balancing different paths to the destination network.

    Edited by: user12273962 on Jul 13, 2012 11:15 AM
  • 11. Re: ALL FIVE of my VM servers have the same, very strange network problem
    Terry Phelps Newbie
    Currently Being Moderated
    When I say "same network", I mean both on "172.16.0.0/16". There are certainly multiple Cicsco switches involved. I don't have details of the network topology.

    No, there is no firewall involved anywhere. The strangest symptom, I think, is: ICMP requests arrive at the OEL box (tcpdump shows them!), but NOTHING is sent in reply.

    Yes, it seems to be only Oracle Linux machines that exhibit this problem, and I've been running tests to OEL, Windows, ESX, and AIX machines for two weeks, and the problem shows up ONLY on OEL.
  • 12. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user12273962 Pro
    Currently Being Moderated
    Sorry I couldn't help. In 15 years, I've never seen this happen in any other scenarios.

    What model NIC are you using?
  • 13. Re: ALL FIVE of my VM servers have the same, very strange network problem
    user157995 Explorer
    Currently Being Moderated
    It just occured to me you said how if you stop pinging, in 10 minutes you loose everything. Thats a classic case of your switches purging their mac address tables. I would say its time you talk to whomever runs your network and start investigating at the switching level to see if there is a NIC / port / mac issue.
  • 14. Re: ALL FIVE of my VM servers have the same, very strange network problem
    Terry Phelps Newbie
    Currently Being Moderated
    I gotta think this is a host problem, and not a switch problem, because:

    1. I've got a data center with many servers in it, Oracle Linux, Windows, AIX, and VMWare ESX. ONLY Oracle Linux shows the problem.
    2. When a host gets hosed, tcpdump shows that the ICMP requests from my desktop-1 ARE arriving, but no ICMP response is being sent. Why would he disregard the pings? No, there is no firewall involved anywhere.
    3. When the host is hosed, is is able to ping, AND get a response from, desktop-1, even though pings from desktop-1 are not being answered.
    3. When the host is hosed, and is ignoring pings (and other traffic, like tcp syn packets) from diesktop-1, the host (1) can ping desktop-1 (same network as desktop-1, but (2) traceroute to desktop-2 FAILS, saying "network is down". That is completely illogical to me. The ONE NIC in the box is certainly not down, because I'm SSHed into it, doing the pings to desktop-1 and -2.
    4. Something new: If I enter "ip route flush cache" on the host when it's hosed, the error clears up instantly. If I dump the route cache with "netstat -nrC", I don't see anything obviously bad. The only route defined in the host is its default route.

    I have questions open to the Linux kernel experts, and others, but have still not gotten any help to speak of.
1 2 3 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points