9 Replies Latest reply: Nov 12, 2012 11:24 PM by 947524 RSS

    RAC节点宕机,疑似网络问题

    947524
      生产系统,2节点RAC,RAC套件版本为11.2.0.4,操作系统为RHEL 5.4 x86_64。服务器为HP DL380G7。

      最近每个月都出现同样的问题,心跳网络报错,然后节点2数据库宕掉,重启服务器方能解决问题。

      以下为故障时间点错误日志(eth0为心跳网络所在网卡)

      *##### messages #####*
      Nov 9 15:01:39 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
      Nov 9 15:01:42 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
      Nov 9 15:01:52 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
      Nov 9 15:01:55 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
      Nov 9 15:02:02 mesdb2 avahi-daemon[9460]: Withdrawing address record for 192.168.21.47 on eth2.
      Nov 9 15:02:02 mesdb2 avahi-daemon[9460]: Withdrawing address record for 192.168.21.7 on eth2.
      Nov 9 15:02:05 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
      Nov 9 15:02:07 mesdb2 avahi-daemon[9460]: Withdrawing address record for 169.254.8.102 on eth0.

      *##### altermesdb2.log #####*
      2012-11-09 15:01:47.635
      [cssd(10044)]CRS-1612:Network communication with node mesdb1 (1) missing for 50% of . Removal of this node from cluster in 14.540 seconds
      2012-11-09 15:01:55.655
      [cssd(10044)]CRS-1611:Network communication with node mesdb1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.520 seconds
      2012-11-09 15:01:59.663
      [cssd(10044)]CRS-1610:Network communication with node mesdb1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.510 seconds
      2012-11-09 15:02:02.181
      [cssd(10044)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /oracle/grid/11.2/log/mesdb2/cssd/ocssd.log.
      2012-11-09 15:02:02.185
      [cssd(10044)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /oracle/grid/11.2/log/mesdb2/cssd/ocssd.log

      *##### ocssd.log ######*

      2012-11-09 15:00:58.288: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:00:58.288: [    CSSD][1080224064]clssnmSendingThread: sent 4 status msgs to all nodes
      2012-11-09 15:01:02.296: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:02.296: [    CSSD][1080224064]clssnmSendingThread: sent 4 status msgs to all nodes
      2012-11-09 15:01:07.306: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:07.306: [    CSSD][1080224064]clssnmSendingThread: sent 5 status msgs to all nodes
      2012-11-09 15:01:12.316: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:12.316: [    CSSD][1080224064]clssnmSendingThread: sent 5 status msgs to all nodes
      2012-11-09 15:01:17.326: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:17.326: [    CSSD][1080224064]clssnmSendingThread: sent 5 status msgs to all nodes
      2012-11-09 15:01:21.334: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:21.334: [    CSSD][1080224064]clssnmSendingThread: sent 4 status msgs to all nodes
      2012-11-09 15:01:25.342: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:25.342: [    CSSD][1080224064]clssnmSendingThread: sent 4 status msgs to all nodes
      2012-11-09 15:01:29.350: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:29.350: [    CSSD][1080224064]clssnmSendingThread: sent 4 status msgs to all nodes
      2012-11-09 15:01:34.360: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:34.360: [    CSSD][1080224064]clssnmSendingThread: sent 5 status msgs to all nodes
      2012-11-09 15:01:39.370: [    CSSD][1080224064]clssnmSendingThread: sending status msg to all nodes
      2012-11-09 15:01:39.370: [    CSSD][1080224064]clssnmSendingThread: sent 5 status msgs to all nodes
      2012-11-09 15:01:40.182: [GIPCHGEN][1098344768] gipchaInterfaceFail: marking interface failing 0x2aaab02c8980 { host '', haName 'CSS_scan', local (nil), ip '172.16.21.10', subnet '172.16.21.0', mask '255.255.255.0', numRef 1, numFail 0, flags 0x4d }
      2012-11-09 15:01:40.190: [GIPCHGEN][1096767808] gipchaInterfaceFail: marking interface failing 0x11fbea70 { host 'mesdb1', haName 'CSS_scan', local 0x2aaab02c8980, ip '172.16.21.9:56687', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0x6 }
      2012-11-09 15:01:40.200: [GIPCHGEN][1096767808] gipchaInterfaceDisable: disabling interface 0x2aaab02c8980 { host '', haName 'CSS_scan', local (nil), ip '172.16.21.10', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 1, flags 0x1cd }
      2012-11-09 15:01:40.212: [GIPCHGEN][1096767808] gipchaInterfaceDisable: disabling interface 0x11fbea70 { host 'mesdb1', haName 'CSS_scan', local 0x2aaab02c8980, ip '172.16.21.9:56687', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0x86 }
      2012-11-09 15:01:40.213: [GIPCHALO][1096767808] gipchaLowerCleanInterfaces: performing cleanup of disabled interface 0x11fbea70 { host 'mesdb1', haName 'CSS_scan', local 0x2aaab02c8980, ip '172.16.21.9:56687', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0xa6 }
      2012-11-09 15:01:40.213: [GIPCHGEN][1096767808] gipchaInterfaceReset: resetting interface 0x11fbea70 { host 'mesdb1', haName 'CSS_scan', local 0x2aaab02c8980, ip '172.16.21.9:56687', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0xa6 }
      2012-11-09 15:01:40.222: [GIPCHDEM][1096767808] gipchaWorkerCleanInterface: performing cleanup of disabled interface 0x2aaab02c8980 { host '', haName 'CSS_scan', local (nil), ip '172.16.21.10', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0x1ed }
      2012-11-09 15:01:40.222: [GIPCHTHR][1096767808] gipchaWorkerUpdateInterface: created remote interface for node 'mesdb1', haName 'CSS_scan', inf 'udp://172.16.21.9:56687'
      2012-11-09 15:01:40.222: [GIPCHALO][1096767808] gipchaLowerCleanInterfaces: forcing interface purge due to loss of all comms node 0x2aaab01fca50 { host 'mesdb1', haName 'CSS_scan', srcLuid 787fbb33-023e177a, dstLuid ec6439a0-44129130 numInf 1, contigSeq 1140862, lastAck 1140845, lastValidAck 1140862, sendSeq [1140846 : 1140855], createTime 16963874, flags 0x4808 }
      2012-11-09 15:01:40.222: [GIPCHGEN][1096767808] gipchaInterfaceDisable: disabling interface 0x11fbea70 { host 'mesdb1', haName 'CSS_scan', local (nil), ip '172.16.21.9:56687', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0x6 }
      2012-11-09 15:01:40.232: [GIPCHALO][1096767808] gipchaLowerCleanInterfaces: performing cleanup of disabled interface 0x11fbea70 { host 'mesdb1', haName 'CSS_scan', local (nil), ip '172.16.21.9:56687', subnet '172.16.21.0', mask '255.255.255.0', numRef 0, numFail 0, flags 0x226 }
        • 1. Re: RAC节点宕机,疑似网络问题
          LiuMaclean(刘相兵)
          请把发生问题的 节点的 alert.log 和 cssd.log发给我

          有问题请去OTN中文论坛开个帖子 我会回复 地址:http://www.otncn.org
          如果需要发送附件,可以直接发邮件到 liu.maclean@gmail.com
          • 2. Re: RAC节点宕机,疑似网络问题
            958613
            记得HP 网卡固件版本比较低,我们公司580 G7曾经引发过节点服务器HANG住的事情
            • 3. Re: RAC节点宕机,疑似网络问题
              947524
              已经用附件发到你Gmail了。
              • 4. Re: RAC节点宕机,疑似网络问题
                LiuMaclean(刘相兵)
                我说的 alert.log 是 RDBMS HOME 实例的 alert.log
                • 5. Re: RAC节点宕机,疑似网络问题
                  xifenfei
                  感觉是硬件问题,查看下,是否需要升级一下网卡驱动?
                  • 6. Re: RAC节点宕机,疑似网络问题
                    908705
                    最近遇到类似问题:

                    1. 我的环境使用 HP VC FlexFabric 10Gb/24-Port Module,网卡 NC551i,
                    HP 官方文档建议 VC 升级固件,网卡 固件、驱动同时升级,否则会有1分钟的瞬断问题。

                    2.CSSD Fails to Join the Cluster After Private Network Recovered if avahi Daemon is up and Running [ID 1501093.1]

                    3.你的描述里提到必须重启主机才能解决问题?
                    有该故障的网卡为 NetXen 的网卡(升级微码、驱动可以解决),并未听说 bnx 的网卡也有此问题。
                    • 7. Re: RAC节点宕机,疑似网络问题
                      947524
                      已经更新附件,请查收邮件。
                      • 8. Re: RAC节点宕机,疑似网络问题
                        LiuMaclean(刘相兵)
                        Fri Nov 09 15:02:04 2012
                        WARNING: Read Failed. group:1 disk:4 AU:1486343 offset:1032192 size:8192
                        WARNING: failed to read mirror side 1 of virtual extent 27357 logical extent 0 of file 1437 in group [1.3760756846] from disk MESDB_DISK01  allocation unit 1486343 reason error; if possible,will try another mirror side 
                        
                        
                        ...........................
                        
                        
                        WARNING: Read Failed. group:1 disk:4 AU:1486343 offset:1032192 size:8192Instance terminated by ASMB, pid = 11110
                        Fri Nov 09 15:52:02 2012
                        Adjusting the default value of parameter parallel_max_servers
                        from 960 to 485 due to the value of parameter processes (500)
                        Starting ORACLE instance (normal)
                        LICENSE_MAX_SESSION = 0
                        LICENSE_SESSIONS_WARNING = 0
                        Private Interface 'eth0:1' configured from GPnP for use as a private interconnect.
                          [name='eth0:1', type=1, ip=169.254.8.102, mac=b4-99-ba-bd-2f-50, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
                        Public Interface 'eth2' configured from GPnP for use as a public interface.
                          [name='eth2', type=1, ip=192.168.21.10, mac=b4-99-ba-bd-2f-54, net=192.168.21.0/24, mask=255.255.255.0, use=public/1]
                        Public Interface 'eth2:1' configured from GPnP for use as a public interface.
                          [name='eth2:1', type=1, ip=192.168.21.47, mac=b4-99-ba-bd-2f-54, net=192.168.21.0/24, mask=255.255.255.0, use=public/1]
                        Picked latch-free SCN scheme 3
                        节点2 从Nov 09 15:02:04 2012 开始 无法读取ASM DISK MESDB_DISK01 , 至 Nov 09 15:52:02 2012前 ASMB 进程重启Instance

                        Private Interface 'eth0:1' 169.254.8.102
                        Nov  9 15:01:39 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:01:42 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:01:52 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:01:55 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:02:02 mesdb2 avahi-daemon[9460]: Withdrawing address record for 192.168.21.47 on eth2.
                        Nov  9 15:02:02 mesdb2 avahi-daemon[9460]: Withdrawing address record for 192.168.21.7 on eth2.
                        Nov  9 15:02:05 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:02:07 mesdb2 avahi-daemon[9460]: Withdrawing address record for 169.254.8.102 on eth0.
                        Nov  9 15:02:08 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:02:19 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:02:21 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:02:52 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:02:54 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:03:35 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:03:37 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:04:13 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:04:16 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:04:56 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:04:58 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:05:34 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:05:36 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:06:17 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:06:19 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:06:55 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:06:58 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:07:33 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:07:36 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:08:16 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:08:19 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:08:59 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:09:02 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:09:37 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:09:40 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:10:21 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:10:23 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:10:59 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:11:01 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:11:42 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:11:45 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:12:30 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:12:33 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:13:14 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:13:16 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:14:02 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:14:04 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:14:50 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:14:52 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:15:33 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:15:35 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:16:21 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:16:23 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:17:04 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:17:06 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:17:47 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:17:49 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:18:35 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:18:37 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:19:18 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:19:20 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:20:05 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:20:08 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:20:48 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:20:51 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:21:36 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:21:39 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:24:09 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:24:12 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:26:32 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:26:34 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:29:00 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:29:03 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:31:23 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Down
                        Nov  9 15:31:26 mesdb2 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON
                        Nov  9 15:33:27 mesdb2 avahi-daemon[9460]: Withdrawing address record for 172.16.21.10 on eth0.
                        节点2的 eth0 时好时坏 ,建议你要求硬件厂商 诊断该 eth0 网络接口故障
                        • 9. Re: RAC节点宕机,疑似网络问题
                          947524
                          综合各个症状,已经将问题焦点放到网卡/驱动/Firmware上。

                          有其他进展,我会及时更新。