1 2 Previous Next 16 Replies Latest reply: Mar 1, 2012 7:22 AM by Amaury RSS

    OVM 3.0.3 - sudden network death

    421043
      Hi together,

      We have setup a Test-Environment with OVM 3.0.3.
      So far all works fine, except one big show-stopper.
      OVM Servers fencing out of the cluster on random basis without any pattern due to network issue.


      Short explaination of the Setup:*

      Storage: HP EVA 6500 via FC

      4 OVM 3.0.3 Servers:
      - HP BL460c G7 each with 2xSixCore MT and 48GB RAM
      ==> disabled VT-D BIOS option after the installation of the Servers
      ==> No manuall changes in the config of dom0 so far

      - each Server with QLogic QMH2562 8Gb FC HBA

      - 6 NICs per Server:

      - each Blade with 10 GbE NC553i FlexFabric, 2 Ports (eth0/eth1)
      lspci: Emulex Corporation OneConnect 10Gb NIC (be3) (rev 01)
      ethtool -i: driver: be2net , version: 2.103.298r , firmware-version: 3.102.517.701

      - Server1 and Server2 with additional NC364m Quad Port 1Gb NIC (eth2 to eth5)
      lspci: Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter
      ethtool -i: driver: e1000e , version: 1.2.20-k2 , firmware-version: 5.12-6

      - Server3 and Server4 with additional NC325m Quad Port 1Gb NIC (eth2 to eth5)
      lspci: Broadcom Corporation NetXtreme BCM5715S Gigabit Ethernet
      ethtool -i: driver: tg3 , version: 3.113 , firmware-version: 5715s-v3.28

      - Network Configuration:
      bond0 - eth4/eth5 - 2x1Gb active-backup : Management, LiveMigration, Cluster Heartbeat (Network not routed)
      bond1 - eth1/eth3 - 10Gb/1GB active-backup : OVM Guest Networks (4 VLANs)
      bond2 - eth0/eth2 - 10Gb/1GB active-backup : reserved for SAN Access via iSCSI (currently not in use / Network not routed)


      - 1 Serverpool with 2 Repositories on FC and several FC LUNs for direct Guest Access



      Problem:_
      The network “freezes” regularly (we didn’t detect any obvious cause or pattern). The server is gracefully rebooted due to the lost connection to the cluster.
      All 4 Servers are affected.


      Findings so far:_

      In the collection below, "Server2" rebootet.
      Server1 is the representative for the surviving nodes.


      The usual messages in case of connection lost can be found:

      Server2 kernel: [40043.562729] o2net: Connection to node "Server1" (num 0) at 10.35.152.101:7777 has been idle for 60.5 secs, shutting it down.

      Server1 kernel: [39162.054394] o2net: Connection to node "Server2" (num 1) at 10.35.152.102:7777 shutdown, state 8
      Server1 kernel: [39162.054429] o2net: No longer connected to node "Server2" (num 1) at 10.35.152.102:7777

      Nothing explains why the connection has been lost.

      Sometimes we can see some ocfs2 exceptions: “ocfs2: Unaligned AIO/DIO on inode <inode> on device <device> by loop1”. But we dismiss this after having read the thread Poolfs corruption

      This is the only information we get out of the logs.
      No other useful information at all in ovm-consoled.log, ovs-agent.log, xm log, dmesg (with "dynamic debugging"enabled) etc…



      Additional Information collected via background script:


      - O2CB heartbeat to disk is active and working fine for all 4 Servers (until the graceful reboot)

      - Link Status for all physical interfaces and the bonds is UP

      - No errors or corruptions reported in the switch logs

      - bond0 status is OK with TX/RX Traffic until reboot, no failover, 2 pakets lost (rx_no_buffer_count: 2)
      RX/TX count of active NIC is inline with RX/TX count of the bond interface,
      however we see an increase of RX count on the inactive backup interface eth5 before the reboot.

      - connection test via 'nc' (tcp:7777) during the crash
      Server1 --> Server2: "No route to host"
      Server2 --> Server1: "Connection timed out"
      Server2 --> Server2: "succeeded!"

      - No changes/corruptions on the routing tables

      - Netstat shows the following for the o2net session from Server2 to Server1:

      on Server1 : changes from ESTABLISHED to LAST_ACK
      on Server2 : changes from ESTABLISHED to FIN_WAIT1

      But again, even pinging bond0 on Server2 from external is not possible.

      - ARP Tables on Server1 and Server2 are valid for arround 30-50 sec after the connection issue occures

      - There is no load (0.9) on Server2 when the connections are lost, increasing load (4-5) on Server1 and Server2 (until reboot) afterwards

      - netstat -s on Server2 reports shortly before reboot
      20 failed connection attempts
      84 other TCP timeouts
      1 connections reset due to unexpected data
      2 connections reset due to early user close
      4 connections aborted due to timeout


      - we noticed in addition in the NIC Statistics of active but unused eth0:

      rx_address_match_errors: 123156


      - General Kernel Parameters:

      sysctl -a | egrep 'net.core|panic' | grep -v " = 0"
      kernel.panic = 60
      kernel.panic_on_oops = 1
      net.core.somaxconn = 128
      net.core.xfrm_aevent_etime = 10
      net.core.xfrm_aevent_rseqth = 2
      net.core.xfrm_larval_drop = 1
      net.core.xfrm_acq_expires = 30
      net.core.wmem_max = 131071
      net.core.rmem_max = 131071
      net.core.wmem_default = 118784
      net.core.rmem_default = 118784
      net.core.dev_weight = 64
      net.core.netdev_max_backlog = 1000
      net.core.message_cost = 5
      net.core.message_burst = 10
      net.core.optmem_max = 20480
      net.core.netdev_budget = 300
      net.core.warnings = 1

      - default bond setup:
      grep bond0 /etc/modprobe.conf
      alias bond0 bonding
      options bond0 mode=1 miimon=250 use_carrier=1 updelay=500 downdelay=500


      In the meantime we installed tcpdump on all Servers, which will be triggered during the next "outage".
      Hopefully we get some more hints.
      We’ll also upgrade the firmware of the 10 GbE NC553i FlexFabric NICs.

      Might be that its not only an issue with the Management Bond, but at least the connection loss on bond0 is causing the reboot.



      Main Questions at the moment:*

      Are we searching in the right direction?
      What is going on with the network?
      How to further debug?

      We are in some kind of stuck position as we can`t reproduce the problem.
      Maybe somebody faced simmilar issues already and has any idea, hint or sollution

      Any feedback is more then welcome.

      Thanks,
      Claudius
        • 1. Re: OVM 3.0.3 - sudden network death
          Avi Miller-Oracle
          Claudius wrote:
          OVM Servers fencing out of the cluster on random basis without any pattern due to network issue.
          Log an SR with Oracle Support: you may have to enable a netconsole server so that we can get messages when the networking fails.
          • 2. Re: OVM 3.0.3 - sudden network death
            421043
            Hi Avi,

            Thanks for your input.
            Unfortunately we are still in an evaluation phase and the Systems are not yet under support – so logging an SR is currently not possible.

            But thanks for the hint with netconsole – we will set it up tomorrow.

            In addition we now upgraded the firmware of the 10 GbE NC553i FlexFabric to version 4.0.360.15a (21 Jan 2012), which includes interesting bugfixes.

            We also removed eth5 from bond0 and configured it on all servers directly with an IP to check if the issue is only bond related or not.

            Hopefully we have some useful information next time.

            Thanks,
            Claudius
            • 3. Re: OVM 3.0.3 - sudden network death
              Avi Miller-Oracle
              Claudius wrote:
              We also removed eth5 from bond0 and configured it on all servers directly with an IP to check if the issue is only bond related or not.
              I know there were some issues with balance-alb (mode 6) bonding, but you're using active-passive so that's probably not the issue. But, it would be interesting to see if removing the bonds assists.

              Also, make sure that you don't have multiple NICs on the same network, as you may get arp flux over the bonds. Each bond pair should be on a different network or VLAN.
              • 4. Re: OVM 3.0.3 - sudden network death
                Ronen Kofman
                Hello Claudius

                Please see if the power management bios settings allow the CPU to go into C state deeper than C1, if it does please set max C state to C1.

                Let me know if this has solved the issue
                • 5. Re: OVM 3.0.3 - sudden network death
                  421043
                  Hi together,

                  first at all thanks for all your input.

                  The Firmware upgrade of the 10 GbE NC553i FlexFabric did not solve the problem.
                  2 different Servers fenced out of the cluster afterwards.


                  But we got some more information overnight:

                  Only bond0 is effected.
                  bond2 works as expected and the (un-bonded) directly configured eth5 as well.


                  This time we noticed the following:

                  ifconfig shows bond0 and eth4 as kind of "frozen" - means no changes in the statistics from connection loss until reboot:


                  bond0:
                  RX packets:2782276 errors:0 dropped:0 overruns:0 frame:0
                  TX packets:2964972 errors:0 dropped:0 overruns:0 carrier:0

                  eth4:
                  RX packets:2782276 errors:0 dropped:0 overruns:0 frame:0
                  TX packets:2964972 errors:0 dropped:0 overruns:0 carrier:0


                  However, ethtool -S eth4 reports the following:

                  first statistic we captured:

                  rx_packets: 2782345
                  tx_packets: 2964989
                  rx_no_buffer_count: 0

                  last statistic befor the reboot:

                  rx_packets: 2782682
                  tx_packets: 2965087
                  rx_no_buffer_count: 154


                  Which means there is still traffic on the physical layer.
                  In addition we noticed an increase of rx_address_match_errors on eth0 by 164


                  In the sniffer logs for bond0 we can see frequent ARP requests by failing Server2.
                  We also can see our ping as "ICMP echo request" from Server2 to Server1.
                  Server1 receives the "ICMP echo request" and correctly replies.
                  But Server2 never receives the reply.


                  We also checked the C-states on all 4 Servers.


                  3 of them had the following settings:

                  Idle Power Core State: C-6
                  Idle Power Package State: C-3

                  Now we have fully disabled c-states on all 4 Servers.

                  We also came across the following article: http://support.citrix.com/article/CTX129551
                  But this seems to be no option, as the server is immediatly rebooting when unloading the be2iscsi module.


                  Might be, the c-states caused the problems.
                  But one of the servers had c-states disbaled and was allready affected.

                  Will further monitor the systems and keep you updated.

                  Thanks,
                  Claudius
                  • 6. Re: OVM 3.0.3 - sudden network death
                    Avi Miller-Oracle
                    Claudius wrote:
                    In the sniffer logs for bond0 we can see frequent ARP requests by failing Server2.
                    We also can see our ping as "ICMP echo request" from Server2 to Server1.
                    Server1 receives the "ICMP echo request" and correctly replies.
                    But Server2 never receives the reply.
                    Are you perhaps suffering from arp flux: http://linux-ip.net/html/ether-arp.html (scroll to item 2.1.4)? This could also be your switching environment not forwarded arp responses as a way of "protecting" from flux.
                    • 7. Re: OVM 3.0.3 - sudden network death
                      421043
                      Hi Avi,

                      thumbs up - you might have hit the target ...but not a 100% sure.
                      Looks like we have arp flux on eth4 and eth5 (at least when eth5 is off the bond0)!

                      Quick explaination:

                      eth4 of Server1 AND Server2 is connected to core-switch A
                      eth5 of Server1 AND Server2 is connected to core-switch B
                      ==> same broadcast domain
                      Switch A and B have a trunk.


                      While bond0 only contains of eth4 and eth5 has a dedicated IP assigned, we see the arp flux:

                      Server2 $ arping -I bond0 Server1
                      ARPING 10.35.152.101 from 10.35.152.102 bond0
                      Unicast reply from 10.35.152.101 [00:1B:78:78:77:1E] 0.756ms << correct
                      Unicast reply from 10.35.152.101 [00:1B:78:78:77:1F] 0.786ms << wrong
                      Unicast reply from 10.35.152.101 [00:1B:78:78:77:1F] 0.750ms


                      Server1 $ watch -n 1 -d 'ifconfig bond0 ; ifconfig eth4 ; ifconfig eth5'

                      bond0
                      Link encap:Ethernet HWaddr 00:1B:78:78:77:1E
                      inet addr:10.35.152.101 Bcast:10.35.152.255 Mask:255.255.255.0
                      UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
                      RX packets:203244 errors:0 dropped:0 overruns:0 frame:0
                      TX packets:230105 errors:0 dropped:0 overruns:0 carrier:0
                      collisions:0 txqueuelen:0


                      eth4
                      Link encap:Ethernet HWaddr 00:1B:78:78:77:1E
                      UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
                      RX packets:203244 errors:0 dropped:0 overruns:0 frame:0
                      TX packets:230105 errors:0 dropped:0 overruns:0 carrier:0
                      collisions:0 txqueuelen:1000


                      eth5
                      Link encap:Ethernet HWaddr 00:1B:78:78:77:1F
                      inet addr:10.1.1.1 Bcast:10.1.1.255 Mask:255.255.255.0
                      UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
                      RX packets:4600 errors:0 dropped:0 overruns:0 frame:0
                      TX packets:151 errors:0 dropped:0 overruns:0 carrier:0
                      collisions:0 txqueuelen:1000


                      However, if we bring eth5 again into bond0, the arp flux is gone.

                      bond0 has been configured (automatically by OVM Manager) with mode=1
                      ifconfig shows MAC of bond0 == MAC of eth4 == MAC of eth5
                      NOARP flag not set on passiv interface.

                      Not sure, but it might happen, that the backup interface replies on arp requests.
                      If so, we would totally confuse the 2 swiches, which would act like in a active-active cluster by sending also to the passiv interface.
                      We carefully checked our collected results during the crashed of the past days.
                      So far we have not found any hint for arp flux while bond0 has 2 enslaved NICs.
                      The backup interface has always RX packets (e.g broadcast), but never TX packets (means NO wrong arp reply, right?)

                      After the next outage we will have the sniffer results for all network interfaces from shortly before the crash until the reboot of the failing server.
                      We also decided to stop our background scripts to not produce any "background noise"
                      Might be that something interesting happens upfront.

                      Not sure if we are still on the right track , so once done we will step back and give netconsole a try.


                      Thanks,
                      Claudius
                      • 8. Re: OVM 3.0.3 - sudden network death
                        421043
                        Hi together,

                        quick question:
                        If I configure netconsole as described in Doc ID 1351524, do I get more output on the remote side then in local dmesg?
                        Or can I set more debug options on the client side (OVM Server) as described in Doc ID 793684.1?



                        Currently I'm still trying to understand, why we have increasing rx_address_match_errors count on eth0.
                        This interface is primary NIC of bond2 - configured for "Storage" in OVM Manager - but not in use at the moment.
                        So no traffic expected.

                        Either the NIC is somehow in whatever "game" or something defaults back to the "primary/first interface" due to no other option or whatever...

                        Not sure if relevant, but just to not hide it:
                        Initially we started to setup repositories via iSCSI over bond2.
                        Due to unexpected performance on 10GboE infrastructure, we switched to FC.
                        We rolled back everything as stated in the official documentation.
                        Finally we un-configured and deleted the iSCSI Array in OVM Manager.
                        Later on we detected, still open connections from all OVM Servers to iSCSI Target - even after reboot.
                        We found out that deletion of the Array in OVM Manager not cleaned up the Clients (OVM Servers).
                        So we manually cleaned up via:
                        iscsiadm -m node --logoutall=all
                        iscsiadm -m node --op=delete
                        'netstat -anp' no longer showed open sockets.

                        But maybe some surviving config is now making troubbles..?

                        As stated allready, not sure if relevant to answer still open questions...

                        Thanks,
                        Claudius
                        • 9. Re: OVM 3.0.3 - sudden network death
                          Avi Miller-Oracle
                          Claudius wrote:
                          If I configure netconsole as described in Doc ID 1351524, do I get more output on the remote side then in local dmesg?
                          Or can I set more debug options on the client side (OVM Server) as described in Doc ID 793684.1?
                          It's best to open an SR with Oracle Support for this sort of question. I don't honestly know, I'm afraid.
                          • 10. Re: OVM 3.0.3 - sudden network death
                            421043
                            Hi together,

                            short update after the weekend.

                            Since we disabled the c-state settings on all Servers, no Server fenced out of the Cluster for 4 days!
                            We now disabled our tcpdump background jobs to check, if the sniffing did some kind of "keep alive"...
                            If the servers stay stable, the c-states really caused the problem.

                            Will keep you updated to hopefully mark the thread as answered asap.

                            Thanks a lot,
                            Claudius
                            • 11. Re: OVM 3.0.3 - sudden network death
                              Amaury
                              Hi Claudius,

                              I think I'm hitting the same issue. Which type of servers do you use ? HP Proliant Blade ? Which kind of CPU ?

                              I'm using Proliant G6 Blades with AMD Opterons and I'm not sure which Power Options I should deactivate.
                              • 12. Re: OVM 3.0.3 - sudden network death
                                421043
                                Hi Amaury,

                                We are using HP Blades BL460c G7 each with 2 Intel Xeon E5649 CPUs (Multithreading enabled).
                                Please check the beginning of the initial post regarding our setup.
                                We disabled the c-states in the Bios as suggested by Ronen - (so far) no more issue afterwards.
                                But not sure if its the same with AMD Opetron.

                                Please check out the 2 links below for more details.
                                http://www.hardwaresecrets.com/article/Everything-You-Need-to-Know-About-the-CPU-C-States-Power-Saving-Modes/611/1
                                http://support.citrix.com/article/CTX127395

                                Also ML Support Doc. ID 1312709.1 gives some hints.

                                Claudius
                                • 13. Re: OVM 3.0.3 - sudden network death
                                  421043
                                  Hi together,

                                  I have marked the thread as answered.
                                  Since more then one week the system are stable up and running.
                                  The only thing we really changed was the c-states as suggested by Ronen.
                                  Thanks for the hint – it would have been quite confusing trouble shooting for some more days…

                                  Hope this thread is useful for others, too.

                                  Special Thanks to Ronen and Avi for their contribution.

                                  Claudius
                                  • 14. Re: OVM 3.0.3 - sudden network death
                                    Ronen Kofman
                                    You are most welcome.
                                    1 2 Previous Next