3 Replies Latest reply on Oct 14, 2013 10:17 AM by 1012438

    Oracle 6.4 HA cluster problem

    1012438

      I have a cluster from two nodes. Their configs:

      node 1:

      <?xml version="1.0"?>
      <cluster config_version="46" name="svfeOL-cluster">
              <cman expected_votes="1" two_node="1"/>
              <clusternodes>
                      <clusternode name="a-svfeOL" nodeid="1"/>
                      <clusternode name="b-svfeOL" nodeid="2"/>
              </clusternodes>
              <rm log_facility="local4" log_level="5">
                      <failoverdomains>
                              <failoverdomain name="svfeOL-cluster" nofailback="1" restricted="1">
                                      <failoverdomainnode name="a-svfeOL"/>
                                      <failoverdomainnode name="b-svfeOL"/>
                              </failoverdomain>
                      </failoverdomains>
                      <resources>
                              <script file="/root/sv_run.sh" name="sv_run"/>
                              <ip address="10.10.60.4/24" sleeptime="2"/>
                      </resources>
                      <service domain="svfeOL-cluster" name="svfeOL" recovery="relocate">
                              <ip ref="10.10.60.4/24"/>
                              <script __failure_expire_time="120" __independent_subtree="1" __max_failures="5" __max_restarts="2" __restart_expire_time="120" ref="sv_run"/>
                      </service>
              </rm>
              <totem token="20000"/>
              <logging debug="on"/>
              <dlm enable_plock="0" plock_ownership="0"/>
              <gfs_controld enable_plock="0" plock_ownership="0"/>
      </cluster>
      

       

      node 2:

      <?xml version="1.0"?>
      <cluster config_version="46" name="svfeOL-cluster">
              <cman expected_votes="1" two_node="1"/>
              <clusternodes>
                      <clusternode name="a-svfeOL" nodeid="1"/>
                      <clusternode name="b-svfeOL" nodeid="2"/>
              </clusternodes>
              <rm log_facility="local4" log_level="5">
                      <failoverdomains>
                              <failoverdomain name="svfeOL-cluster" nofailback="1" restricted="1">
                                      <failoverdomainnode name="a-svfeOL"/>
                                      <failoverdomainnode name="b-svfeOL"/>
                              </failoverdomain>
                      </failoverdomains>
                      <resources>
                              <script file="/root/sv_run.sh" name="sv_run"/>
                              <ip address="10.10.60.4/24" sleeptime="2"/>
                      </resources>
                      <service domain="svfeOL-cluster" name="svfeOL" recovery="relocate">
                              <ip ref="10.10.60.4/24"/>
                              <script __failure_expire_time="120" __independent_subtree="1" __max_failures="5" __max_restarts="2" __restart_expire_time="120" ref="sv_run"/>
                      </service>
              </rm>
              <totem token="20000"/>
              <logging debug="on"/>
              <dlm enable_plock="0" plock_ownership="0"/>
              <gfs_controld enable_plock="0" plock_ownership="0"/>
      </cluster>
      

       

      Everything worked well. But here, already two times, in the days off, a cluster I ceased to work. Each time exactly in 05.03 mornings

       

      Aug  4 05:03:34 a-svfeOL corosync[1572]:   [TOTEM ] A processor failed, forming new configuration.
      Aug  4 05:03:36 a-svfeOL kernel: dlm: closing connection to node 2
      Aug  4 05:03:36 a-svfeOL fenced[1716]: fencing node b-svfeOL
      Aug  4 05:03:36 a-svfeOL fenced[1716]: fence b-svfeOL dev 0.0 agent none result: error no method
      Aug  4 05:03:36 a-svfeOL fenced[1716]: fence b-svfeOL failed
      Aug  4 05:03:38 a-svfeOL corosync[1572]:   [QUORUM] Members[1]: 1
      Aug  4 05:03:38 a-svfeOL corosync[1572]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
      Aug  4 05:03:38 a-svfeOL corosync[1572]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:2 left:1)
      Aug  4 05:03:38 a-svfeOL corosync[1572]:   [MAIN  ] Completed service synchronization, ready to provide service.
      Aug  4 05:03:39 a-svfeOL fenced[1716]: fencing node b-svfeOL
      Aug  4 05:03:39 a-svfeOL fenced[1716]: fence b-svfeOL dev 0.0 agent none result: error no method
      Aug  4 05:03:39 a-svfeOL fenced[1716]: fence b-svfeOL failed
      Aug  4 05:03:42 a-svfeOL fenced[1716]: fencing node b-svfeOL
      Aug  4 05:03:42 a-svfeOL fenced[1716]: fence b-svfeOL dev 0.0 agent none result: error no method
      Aug  4 05:03:42 a-svfeOL fenced[1716]: fence b-svfeOL failed
      Aug  4 05:03:52 a-svfeOL corosync[1572]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
      Aug  4 05:03:52 a-svfeOL corosync[1572]:   [QUORUM] Members[2]: 1 2
      Aug  4 05:03:52 a-svfeOL corosync[1572]:   [QUORUM] Members[2]: 1 2
      Aug  4 05:03:52 a-svfeOL corosync[1572]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:1 left:0)
      Aug  4 05:03:52 a-svfeOL corosync[1572]:   [MAIN  ] Completed service synchronization, ready to provide service.
      Aug  4 05:03:55 a-svfeOL fenced[1716]: telling cman to remove nodeid 2 from cluster
      Aug  4 05:04:15 a-svfeOL corosync[1572]:   [TOTEM ] A processor failed, forming new configuration.
      Aug  4 05:04:17 a-svfeOL corosync[1572]:   [QUORUM] Members[1]: 1
      Aug  4 05:04:17 a-svfeOL kernel: dlm: closing connection to node 2
      Aug  4 05:04:17 a-svfeOL corosync[1572]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
      Aug  4 05:04:17 a-svfeOL corosync[1572]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:2 left:1)
      Aug  4 05:04:17 a-svfeOL corosync[1572]:   [MAIN  ] Completed service synchronization, ready to provide service.
      Aug  4 05:06:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
      Aug  4 05:06:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug  4 05:06:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
      Aug  4 05:06:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
      Aug  4 05:06:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
      Aug  4 05:06:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
      Aug  4 05:06:01 a-svfeOL kernel: Call Trace:
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81062460>] ? try_to_wake_up+0x230/0x2b0
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81062510>] ? wake_up_state+0x10/0x20
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff810816c2>] ? signal_wake_up_state+0x22/0x40
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffffa02a0f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffffa02ae841>] device_user_lock+0x131/0x140 [dlm]
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffffa02aeb32>] device_write+0x2e2/0x4f0 [dlm]
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81510ade>] ? do_device_not_available+0xe/0x10
      Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
      Aug  4 05:08:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
      Aug  4 05:08:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug  4 05:08:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
      Aug  4 05:08:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
      Aug  4 05:08:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
      Aug  4 05:08:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
      Aug  4 05:08:01 a-svfeOL kernel: Call Trace:
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81062460>] ? try_to_wake_up+0x230/0x2b0
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81062510>] ? wake_up_state+0x10/0x20
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff810816c2>] ? signal_wake_up_state+0x22/0x40
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffffa02a0f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffffa02ae841>] device_user_lock+0x131/0x140 [dlm]
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffffa02aeb32>] device_write+0x2e2/0x4f0 [dlm]
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81510ade>] ? do_device_not_available+0xe/0x10
      Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
      Aug  4 05:09:05 a-svfeOL auditd[1238]: Audit daemon rotating log files
      Aug  4 05:10:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
      Aug  4 05:10:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug  4 05:10:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
      Aug  4 05:10:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
      Aug  4 05:10:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
      Aug  4 05:10:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
      Aug  4 05:10:01 a-svfeOL kernel: Call Trace:
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81062460>] ? try_to_wake_up+0x230/0x2b0
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81062510>] ? wake_up_state+0x10/0x20
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff810816c2>] ? signal_wake_up_state+0x22/0x40
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffffa02a0f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffffa02ae841>] device_user_lock+0x131/0x140 [dlm]
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffffa02aeb32>] device_write+0x2e2/0x4f0 [dlm]
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81510ade>] ? do_device_not_available+0xe/0x10
      Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
      Aug  4 05:12:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
      Aug  4 05:12:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Aug  4 05:12:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
      Aug  4 05:12:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
      Aug  4 05:12:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
      Aug  4 05:12:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
      

       

      b-node:

      Aug  4 05:03:50 b-svfeOL corosync[1456]:   [TOTEM ] A processor failed, forming new configuration.
      Aug  4 05:03:54 b-svfeOL corosync[1456]:   [QUORUM] Members[1]: 2
      Aug  4 05:03:54 b-svfeOL corosync[1456]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
      Aug  4 05:03:54 b-svfeOL corosync[1456]:   [QUORUM] Members[2]: 1 2
      Aug  4 05:03:54 b-svfeOL corosync[1456]:   [QUORUM] Members[2]: 1 2
      Aug  4 05:03:54 b-svfeOL rsyslogd-2177: imuxsock begins to drop messages from pid 1456 due to rate-limiting
      Aug  4 05:03:54 b-svfeOL kernel: dlm: closing connection to node 1
      Aug  4 05:03:56 b-svfeOL rsyslogd-2177: imuxsock lost 218 messages from pid 1456 due to rate-limiting
      Aug  4 05:03:57 b-svfeOL corosync[1456]: cman killed by node 1 because we were killed by cman_tool or other application
      Aug  4 05:03:57 b-svfeOL fenced[1534]: telling cman to remove nodeid 1 from cluster
      Aug  4 05:04:03 b-svfeOL gfs_controld[1602]: cluster is down, exiting
      Aug  4 05:04:03 b-svfeOL gfs_controld[1602]: daemon cpg_dispatch error 2
      Aug  4 05:04:03 b-svfeOL fenced[1534]: daemon cpg_dispatch error 2
      Aug  4 05:04:03 b-svfeOL fenced[1534]: cluster is down, exiting
      Aug  4 05:04:03 b-svfeOL fenced[1534]: daemon cpg_dispatch error 2
      Aug  4 05:04:03 b-svfeOL fenced[1534]: cpg_dispatch error 2
      Aug  4 05:04:03 b-svfeOL rgmanager[1996]: #67: Shutting down uncleanly
      Aug  4 05:04:03 b-svfeOL dlm_controld[1558]: cman_get_cluster error -1 112
      Aug  4 05:04:03 b-svfeOL dlm_controld[1558]: cluster is down, exiting
      Aug  4 05:04:06 b-svfeOL kernel: dlm: closing connection to node 1
      Aug  4 05:04:09 b-svfeOL kernel: dlm: closing connection to node 2
      Aug  4 05:04:09 b-svfeOL kernel: dlm: rgmanager: no userland control daemon, stopping lockspace
      Aug  4 05:04:09 b-svfeOL rgmanager[14800]: [script] Executing /root/sv_run.sh stop
      Aug  4 05:04:30 b-svfeOL rgmanager[14885]: [ip] Removing IPv4 address 10.10.60.4/24 from eth4
      

       

      From logs clearly that he didn't receive a token, further I don't understand why it couldn't throw services. Tried to fence but the agent isn't adjusted.

        • 1. Re: Oracle 6.4 HA cluster problem
          Dude!

          It seems a cluster node failed and the cluster formed a new configuration.

           

          What happens before 5 AM?

           

          Did the network fail?

          Power failure, kernel panic?

          Did something reboot the server?

           

          • 2. Re: Oracle 6.4 HA cluster problem
            1012438

            Thanks for reply.

            The matter is that on events it's OK. In what logs it is possible to see network loss? These are virtual machines.

            Why service wasn't transferred? Nobody rebooted the machine. Strange that that both times in 5.03 mornings. Yes, I saw doc by oracle about this mistake.

            • 3. Re: Oracle 6.4 HA cluster problem
              1012438

              Hello, I need help again! I updated OS, after restart rgmanager incorrectly began to work.

               

              on b-node

              Cluster Status for svfeOL-cluster @ Mon Oct 14 16:12:12 2013

              Member Status: Quorate

              Member Name                                                     ID   Status

              ------ ----                                                              ---- ------

              a-svfeOL                                                            1 Online

              b-svfeOL                                                            2 Online, Local, rgmanager

               

              Service Name                                            Owner (Last)                                               State

              ------- ----                                                     ----- ------                                                     -----

              service:svfeOL                                            b-svfeOL                                                    started

               

              on a-node

              Cluster Status for svfeOL-cluster @ Mon Oct 14 16:12:41 2013

              Member Status: Quorate

              Member Name                                                     ID   Status

              ------ ----                                                             ---- ------

              a-svfeOL                                                            1 Online, Local

              b-svfeOL                                                            2 Online

               

              Service isn't transferred ((

               

              Oct 14 13:29:37 a-svfeOL kernel: INFO: task rgmanager:2318 blocked for more than 120 seconds.

              Oct 14 13:29:37 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

              Oct 14 13:29:37 a-svfeOL kernel: rgmanager       D ffff8801b3be2b28     0  2318   2316 0x00000080

              Oct 14 13:29:37 a-svfeOL kernel: ffff8801b4219bf8 0000000000000086 ffff8801b4219c38 ffffffff81066ef3

              Oct 14 13:29:37 a-svfeOL kernel: 0000000000012180 ffff8801b4219fd8 ffff8801b4218010 0000000000012180

              Oct 14 13:29:37 a-svfeOL kernel: ffff8801b4219fd8 0000000000012180 ffff8801b16442c0 ffff8801b3be2580

              Oct 14 13:29:37 a-svfeOL kernel: Call Trace:

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81066ef3>] ? perf_event_task_sched_out+0x33/0xa0

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81509bef>] schedule+0x3f/0x60

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff8150a09d>] schedule_timeout+0x1fd/0x2e0

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff815094a6>] ? __schedule+0x3f6/0x810

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81041079>] ? default_spin_lock_flags+0x9/0x10

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81509a7a>] wait_for_common+0x11a/0x170

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff810623e0>] ? try_to_wake_up+0x2b0/0x2b0

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81509bad>] wait_for_completion+0x1d/0x20

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025438c>] new_lockspace+0x85c/0x8d0 [dlm]

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025445a>] dlm_new_lockspace+0x5a/0xe0 [dlm]

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff811fd48a>] ? security_capable+0x2a/0x30

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025bf2d>] device_create_lockspace+0x6d/0x150 [dlm]

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81081723>] ? set_current_blocked+0x53/0x70

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025cb15>] device_write+0x2c5/0x4f0 [dlm]

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff8116cf18>] vfs_write+0xc8/0x190

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff8116d0e1>] sys_write+0x51/0x90

              Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81514082>] system_call_fastpath+0x16/0x1b