This discussion is archived
3 Replies Latest reply: Oct 14, 2013 3:17 AM by 1012438 RSS

Oracle 6.4 HA cluster problem

1012438 Newbie
Currently Being Moderated

I have a cluster from two nodes. Their configs:

node 1:

<?xml version="1.0"?>
<cluster config_version="46" name="svfeOL-cluster">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="a-svfeOL" nodeid="1"/>
                <clusternode name="b-svfeOL" nodeid="2"/>
        </clusternodes>
        <rm log_facility="local4" log_level="5">
                <failoverdomains>
                        <failoverdomain name="svfeOL-cluster" nofailback="1" restricted="1">
                                <failoverdomainnode name="a-svfeOL"/>
                                <failoverdomainnode name="b-svfeOL"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/root/sv_run.sh" name="sv_run"/>
                        <ip address="10.10.60.4/24" sleeptime="2"/>
                </resources>
                <service domain="svfeOL-cluster" name="svfeOL" recovery="relocate">
                        <ip ref="10.10.60.4/24"/>
                        <script __failure_expire_time="120" __independent_subtree="1" __max_failures="5" __max_restarts="2" __restart_expire_time="120" ref="sv_run"/>
                </service>
        </rm>
        <totem token="20000"/>
        <logging debug="on"/>
        <dlm enable_plock="0" plock_ownership="0"/>
        <gfs_controld enable_plock="0" plock_ownership="0"/>
</cluster>

 

node 2:

<?xml version="1.0"?>
<cluster config_version="46" name="svfeOL-cluster">
        <cman expected_votes="1" two_node="1"/>
        <clusternodes>
                <clusternode name="a-svfeOL" nodeid="1"/>
                <clusternode name="b-svfeOL" nodeid="2"/>
        </clusternodes>
        <rm log_facility="local4" log_level="5">
                <failoverdomains>
                        <failoverdomain name="svfeOL-cluster" nofailback="1" restricted="1">
                                <failoverdomainnode name="a-svfeOL"/>
                                <failoverdomainnode name="b-svfeOL"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/root/sv_run.sh" name="sv_run"/>
                        <ip address="10.10.60.4/24" sleeptime="2"/>
                </resources>
                <service domain="svfeOL-cluster" name="svfeOL" recovery="relocate">
                        <ip ref="10.10.60.4/24"/>
                        <script __failure_expire_time="120" __independent_subtree="1" __max_failures="5" __max_restarts="2" __restart_expire_time="120" ref="sv_run"/>
                </service>
        </rm>
        <totem token="20000"/>
        <logging debug="on"/>
        <dlm enable_plock="0" plock_ownership="0"/>
        <gfs_controld enable_plock="0" plock_ownership="0"/>
</cluster>

 

Everything worked well. But here, already two times, in the days off, a cluster I ceased to work. Each time exactly in 05.03 mornings

 

Aug  4 05:03:34 a-svfeOL corosync[1572]:   [TOTEM ] A processor failed, forming new configuration.
Aug  4 05:03:36 a-svfeOL kernel: dlm: closing connection to node 2
Aug  4 05:03:36 a-svfeOL fenced[1716]: fencing node b-svfeOL
Aug  4 05:03:36 a-svfeOL fenced[1716]: fence b-svfeOL dev 0.0 agent none result: error no method
Aug  4 05:03:36 a-svfeOL fenced[1716]: fence b-svfeOL failed
Aug  4 05:03:38 a-svfeOL corosync[1572]:   [QUORUM] Members[1]: 1
Aug  4 05:03:38 a-svfeOL corosync[1572]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  4 05:03:38 a-svfeOL corosync[1572]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:2 left:1)
Aug  4 05:03:38 a-svfeOL corosync[1572]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  4 05:03:39 a-svfeOL fenced[1716]: fencing node b-svfeOL
Aug  4 05:03:39 a-svfeOL fenced[1716]: fence b-svfeOL dev 0.0 agent none result: error no method
Aug  4 05:03:39 a-svfeOL fenced[1716]: fence b-svfeOL failed
Aug  4 05:03:42 a-svfeOL fenced[1716]: fencing node b-svfeOL
Aug  4 05:03:42 a-svfeOL fenced[1716]: fence b-svfeOL dev 0.0 agent none result: error no method
Aug  4 05:03:42 a-svfeOL fenced[1716]: fence b-svfeOL failed
Aug  4 05:03:52 a-svfeOL corosync[1572]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  4 05:03:52 a-svfeOL corosync[1572]:   [QUORUM] Members[2]: 1 2
Aug  4 05:03:52 a-svfeOL corosync[1572]:   [QUORUM] Members[2]: 1 2
Aug  4 05:03:52 a-svfeOL corosync[1572]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:1 left:0)
Aug  4 05:03:52 a-svfeOL corosync[1572]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  4 05:03:55 a-svfeOL fenced[1716]: telling cman to remove nodeid 2 from cluster
Aug  4 05:04:15 a-svfeOL corosync[1572]:   [TOTEM ] A processor failed, forming new configuration.
Aug  4 05:04:17 a-svfeOL corosync[1572]:   [QUORUM] Members[1]: 1
Aug  4 05:04:17 a-svfeOL kernel: dlm: closing connection to node 2
Aug  4 05:04:17 a-svfeOL corosync[1572]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  4 05:04:17 a-svfeOL corosync[1572]:   [CPG   ] chosen downlist: sender r(0) ip(10.10.60.15) ; members(old:2 left:1)
Aug  4 05:04:17 a-svfeOL corosync[1572]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug  4 05:06:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
Aug  4 05:06:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 05:06:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
Aug  4 05:06:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
Aug  4 05:06:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
Aug  4 05:06:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
Aug  4 05:06:01 a-svfeOL kernel: Call Trace:
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81062460>] ? try_to_wake_up+0x230/0x2b0
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81062510>] ? wake_up_state+0x10/0x20
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff810816c2>] ? signal_wake_up_state+0x22/0x40
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffffa02a0f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffffa02ae841>] device_user_lock+0x131/0x140 [dlm]
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffffa02aeb32>] device_write+0x2e2/0x4f0 [dlm]
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81510ade>] ? do_device_not_available+0xe/0x10
Aug  4 05:06:01 a-svfeOL kernel: [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
Aug  4 05:08:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
Aug  4 05:08:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 05:08:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
Aug  4 05:08:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
Aug  4 05:08:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
Aug  4 05:08:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
Aug  4 05:08:01 a-svfeOL kernel: Call Trace:
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81062460>] ? try_to_wake_up+0x230/0x2b0
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81062510>] ? wake_up_state+0x10/0x20
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff810816c2>] ? signal_wake_up_state+0x22/0x40
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffffa02a0f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffffa02ae841>] device_user_lock+0x131/0x140 [dlm]
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffffa02aeb32>] device_write+0x2e2/0x4f0 [dlm]
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81510ade>] ? do_device_not_available+0xe/0x10
Aug  4 05:08:01 a-svfeOL kernel: [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
Aug  4 05:09:05 a-svfeOL auditd[1238]: Audit daemon rotating log files
Aug  4 05:10:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
Aug  4 05:10:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 05:10:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
Aug  4 05:10:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
Aug  4 05:10:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
Aug  4 05:10:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180
Aug  4 05:10:01 a-svfeOL kernel: Call Trace:
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81062460>] ? try_to_wake_up+0x230/0x2b0
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150dfaf>] schedule+0x3f/0x60
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150fe75>] rwsem_down_failed_common+0xc5/0x160
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81062510>] ? wake_up_state+0x10/0x20
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff810816c2>] ? signal_wake_up_state+0x22/0x40
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150ff45>] rwsem_down_read_failed+0x15/0x17
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81266584>] call_rwsem_down_read_failed+0x14/0x30
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8150f194>] ? down_read+0x24/0x30
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffffa02a0f9d>] dlm_user_request+0x4d/0x1c0 [dlm]
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff8115c3a6>] ? kmem_cache_alloc_trace+0x156/0x190
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffffa02ae841>] device_user_lock+0x131/0x140 [dlm]
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81081853>] ? set_current_blocked+0x53/0x70
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffffa02aeb32>] device_write+0x2e2/0x4f0 [dlm]
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff811722d8>] vfs_write+0xc8/0x190
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff811724a1>] sys_write+0x51/0x90
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81510ade>] ? do_device_not_available+0xe/0x10
Aug  4 05:10:01 a-svfeOL kernel: [<ffffffff81518442>] system_call_fastpath+0x16/0x1b
Aug  4 05:12:01 a-svfeOL kernel: INFO: task rgmanager:21716 blocked for more than 120 seconds.
Aug  4 05:12:01 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 05:12:01 a-svfeOL kernel: rgmanager       D ffff8801b097a728     0 21716   2177 0x00000080
Aug  4 05:12:01 a-svfeOL kernel: ffff8801b2753c60 0000000000000086 ffff8801bfd12180 ffff880100000001
Aug  4 05:12:01 a-svfeOL kernel: 0000000000012180 ffff8801b2753fd8 ffff8801b2752010 0000000000012180
Aug  4 05:12:01 a-svfeOL kernel: ffff8801b2753fd8 0000000000012180 ffff8801b4014480 ffff8801b097a180

 

b-node:

Aug  4 05:03:50 b-svfeOL corosync[1456]:   [TOTEM ] A processor failed, forming new configuration.
Aug  4 05:03:54 b-svfeOL corosync[1456]:   [QUORUM] Members[1]: 2
Aug  4 05:03:54 b-svfeOL corosync[1456]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Aug  4 05:03:54 b-svfeOL corosync[1456]:   [QUORUM] Members[2]: 1 2
Aug  4 05:03:54 b-svfeOL corosync[1456]:   [QUORUM] Members[2]: 1 2
Aug  4 05:03:54 b-svfeOL rsyslogd-2177: imuxsock begins to drop messages from pid 1456 due to rate-limiting
Aug  4 05:03:54 b-svfeOL kernel: dlm: closing connection to node 1
Aug  4 05:03:56 b-svfeOL rsyslogd-2177: imuxsock lost 218 messages from pid 1456 due to rate-limiting
Aug  4 05:03:57 b-svfeOL corosync[1456]: cman killed by node 1 because we were killed by cman_tool or other application
Aug  4 05:03:57 b-svfeOL fenced[1534]: telling cman to remove nodeid 1 from cluster
Aug  4 05:04:03 b-svfeOL gfs_controld[1602]: cluster is down, exiting
Aug  4 05:04:03 b-svfeOL gfs_controld[1602]: daemon cpg_dispatch error 2
Aug  4 05:04:03 b-svfeOL fenced[1534]: daemon cpg_dispatch error 2
Aug  4 05:04:03 b-svfeOL fenced[1534]: cluster is down, exiting
Aug  4 05:04:03 b-svfeOL fenced[1534]: daemon cpg_dispatch error 2
Aug  4 05:04:03 b-svfeOL fenced[1534]: cpg_dispatch error 2
Aug  4 05:04:03 b-svfeOL rgmanager[1996]: #67: Shutting down uncleanly
Aug  4 05:04:03 b-svfeOL dlm_controld[1558]: cman_get_cluster error -1 112
Aug  4 05:04:03 b-svfeOL dlm_controld[1558]: cluster is down, exiting
Aug  4 05:04:06 b-svfeOL kernel: dlm: closing connection to node 1
Aug  4 05:04:09 b-svfeOL kernel: dlm: closing connection to node 2
Aug  4 05:04:09 b-svfeOL kernel: dlm: rgmanager: no userland control daemon, stopping lockspace
Aug  4 05:04:09 b-svfeOL rgmanager[14800]: [script] Executing /root/sv_run.sh stop
Aug  4 05:04:30 b-svfeOL rgmanager[14885]: [ip] Removing IPv4 address 10.10.60.4/24 from eth4

 

From logs clearly that he didn't receive a token, further I don't understand why it couldn't throw services. Tried to fence but the agent isn't adjusted.

  • 1. Re: Oracle 6.4 HA cluster problem
    Dude! Guru
    Currently Being Moderated

    It seems a cluster node failed and the cluster formed a new configuration.

     

    What happens before 5 AM?

     

    Did the network fail?

    Power failure, kernel panic?

    Did something reboot the server?

     

  • 2. Re: Oracle 6.4 HA cluster problem
    1012438 Newbie
    Currently Being Moderated

    Thanks for reply.

    The matter is that on events it's OK. In what logs it is possible to see network loss? These are virtual machines.

    Why service wasn't transferred? Nobody rebooted the machine. Strange that that both times in 5.03 mornings. Yes, I saw doc by oracle about this mistake.

  • 3. Re: Oracle 6.4 HA cluster problem
    1012438 Newbie
    Currently Being Moderated

    Hello, I need help again! I updated OS, after restart rgmanager incorrectly began to work.

     

    on b-node

    Cluster Status for svfeOL-cluster @ Mon Oct 14 16:12:12 2013

    Member Status: Quorate

    Member Name                                                     ID   Status

    ------ ----                                                              ---- ------

    a-svfeOL                                                            1 Online

    b-svfeOL                                                            2 Online, Local, rgmanager

     

    Service Name                                            Owner (Last)                                               State

    ------- ----                                                     ----- ------                                                     -----

    service:svfeOL                                            b-svfeOL                                                    started

     

    on a-node

    Cluster Status for svfeOL-cluster @ Mon Oct 14 16:12:41 2013

    Member Status: Quorate

    Member Name                                                     ID   Status

    ------ ----                                                             ---- ------

    a-svfeOL                                                            1 Online, Local

    b-svfeOL                                                            2 Online

     

    Service isn't transferred ((

     

    Oct 14 13:29:37 a-svfeOL kernel: INFO: task rgmanager:2318 blocked for more than 120 seconds.

    Oct 14 13:29:37 a-svfeOL kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

    Oct 14 13:29:37 a-svfeOL kernel: rgmanager       D ffff8801b3be2b28     0  2318   2316 0x00000080

    Oct 14 13:29:37 a-svfeOL kernel: ffff8801b4219bf8 0000000000000086 ffff8801b4219c38 ffffffff81066ef3

    Oct 14 13:29:37 a-svfeOL kernel: 0000000000012180 ffff8801b4219fd8 ffff8801b4218010 0000000000012180

    Oct 14 13:29:37 a-svfeOL kernel: ffff8801b4219fd8 0000000000012180 ffff8801b16442c0 ffff8801b3be2580

    Oct 14 13:29:37 a-svfeOL kernel: Call Trace:

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81066ef3>] ? perf_event_task_sched_out+0x33/0xa0

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81509bef>] schedule+0x3f/0x60

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff8150a09d>] schedule_timeout+0x1fd/0x2e0

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff815094a6>] ? __schedule+0x3f6/0x810

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81041079>] ? default_spin_lock_flags+0x9/0x10

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81509a7a>] wait_for_common+0x11a/0x170

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff810623e0>] ? try_to_wake_up+0x2b0/0x2b0

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81509bad>] wait_for_completion+0x1d/0x20

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025438c>] new_lockspace+0x85c/0x8d0 [dlm]

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025445a>] dlm_new_lockspace+0x5a/0xe0 [dlm]

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff811fd48a>] ? security_capable+0x2a/0x30

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025bf2d>] device_create_lockspace+0x6d/0x150 [dlm]

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81081723>] ? set_current_blocked+0x53/0x70

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffffa025cb15>] device_write+0x2c5/0x4f0 [dlm]

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff8116cf18>] vfs_write+0xc8/0x190

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff8116d0e1>] sys_write+0x51/0x90

    Oct 14 13:29:37 a-svfeOL kernel: [<ffffffff81514082>] system_call_fastpath+0x16/0x1b

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points