8 Replies Latest reply: Dec 22, 2009 1:13 PM by Avi Miller-Oracle RSS

    Full server pool crashes when adding new iSCSI server

    733736
      Hi,

      we have a Pool Group with 2 machines (1 Server Pool Master + Server Virtual Machine + Utility Server and another 1 Server Virtual Machine). Both have a iSCSI Shared Disk which builds /OVS partition.

      This is working, we can use High Availability, Migrate guests etc.

      But when adding new Server Virtual Machines to the pool (with guests running), current machines in the Pools get restarted.

      My question is, can server virtual machines be "hot added" to the pool group while guests are running?

      Thanks and regards,
      Marc
        • 1. Re: Full server pool crashes when adding new iSCSI server
          Avi Miller-Oracle
          Marc Caubet wrote:
          But when adding new Server Virtual Machines to the pool (with guests running), current machines in the Pools get restarted.
          My question is, can server virtual machines be "hot added" to the pool group while guests are running?
          Absolutely. It sounds like something may not be properly configured in your network that is causing OCFS2 to fence and reboot the nodes. Can you double-check all the /etc/hosts files across your pool members and ensure that every pool member can resolve the FQDN of every other pool member, including itself. This is crucial for correct OCFS2 configuration. Also, check /var/log/messages around the time of the reboot to see if you see any warnings/errors from OCFS2.
          • 2. Re: Full server pool crashes when adding new iSCSI server
            733736
            Hi,

            hosts file seems to be correct.

            Messages logs during the crash time are the following:

            Node vmserver15 = Server Pool Master, Utility Master and Server Virtual Machine

            Dec 10 12:40:02 vmserver15 kernel: vlan500: port 3(vif6.0) entering disabled state
            Dec 10 12:40:02 vmserver15 kernel: device vif6.0 left promiscuous mode
            Dec 10 12:40:02 vmserver15 kernel: type=1700 audit(1260445202.434:16): dev=vif6.0 prom=0 old_prom=256 auid=4294967295 ses=4294967295
            Dec 10 12:40:02 vmserver15 kernel: vlan500: port 3(vif6.0) entering disabled state
            Dec 10 12:40:02 vmserver15 kernel: loop10: dropped 10114 extents
            Dec 10 12:40:03 vmserver15 udhcpc: udhcp client (v0.9.8) started
            Dec 10 12:40:03 vmserver15 udhcpc: Lease of 193.109.175.25 obtained, lease time 172800
            Dec 10 12:40:04 vmserver15 kernel: device vif7.0 entered promiscuous mode
            Dec 10 12:40:04 vmserver15 kernel: type=1700 audit(1260445204.774:17): dev=vif7.0 prom=256 old_prom=0 auid=4294967295 ses=4294967295
            Dec 10 12:40:04 vmserver15 kernel: vlan500: topology change detected, propagating
            Dec 10 12:40:04 vmserver15 kernel: vlan500: port 3(vif7.0) entering forwarding state
            Dec 10 12:40:05 vmserver15 kernel: loop10: fast redirect
            Dec 10 12:40:06 vmserver15 kernel: blkback: ring-ref 770, event-channel 9, protocol 1 (x86_32-abi)
            Dec 10 12:53:35 vmserver15 kernel: o2net: no longer connected to node vmserver10.pic.es (num 1) at 193.109.174.110:7777
            Dec 10 12:53:36 vmserver15 kernel: (4989,0):o2hb_do_disk_heartbeat:776 ERROR: Device "sdb1": another node is heartbeating in our slot!
            Dec 10 12:53:37 vmserver15 kernel: o2net: accepted connection from node vmserver10.pic.es (num 1) at 193.109.174.110:7777
            Dec 10 12:53:38 vmserver15 kernel: (4989,0):o2hb_do_disk_heartbeat:776 ERROR: Device "sdb1": another node is heartbeating in our slot!
            Dec 10 12:53:50 vmserver15 last message repeated 6 times
            Dec 10 12:53:51 vmserver15 kernel: o2net: no longer connected to node vmserver10.pic.es (num 1) at 193.109.174.110:7777
            Dec 10 12:53:52 vmserver15 kernel: (4989,0):o2hb_do_disk_heartbeat:776 ERROR: Device "sdb1": another node is heartbeating in our slot!
            Dec 10 12:53:53 vmserver15 kernel: o2net: accepted connection from node vmserver10.pic.es (num 1) at 193.109.174.110:7777
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):dlm_send_remote_convert_request:393 ERROR: dlm status = DLM_IVLOCKID+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_cluster_lock:1206 ERROR: DLM error DLM_IVLOCKID while calling dlmlock on resource M000000000000000001050c00000000: bad lockid+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_inode_lock_full:2064 ERROR: status = -22+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_inode_lock_atime:2193 ERROR: status = -22+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):__ocfs2_file_aio_read:2434 ERROR: status = -22+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):dlm_send_remote_convert_request:393 ERROR: dlm status = DLM_IVLOCKID+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_cluster_lock:1206 ERROR: DLM error DLM_IVLOCKID while calling dlmlock on resource M000000000000000001050c00000000: bad lockid+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_inode_lock_full:2064 ERROR: status = -22+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_write_begin:1845 ERROR: status = -22+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):ocfs2_file_buffered_write:2016 ERROR: status = -22+
            Dec 10 12:53:53 vmserver15 kernel: (5638,0):__ocfs2_file_aio_write:2173 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):dlm_send_remote_convert_request:393 ERROR: dlm status = DLM_IVLOCKID+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_cluster_lock:1206 ERROR: DLM error DLM_IVLOCKID while calling dlmlock on resource M000000000000000000020744c1370e: bad lockid+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_inode_lock_full:2064 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_reserve_suballoc_bits:449 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_reserve_cluster_bitmap_bits:682 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_local_alloc_reserve_for_window:930 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_local_alloc_slide_window:1063 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):__ocfs2_reserve_clusters:725 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_lock_allocators:677 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_write_begin_nolock:1751 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_write_begin:1861 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):ocfs2_file_buffered_write:2016 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: (7923,0):__ocfs2_file_aio_write:2173 ERROR: status = -22+
            Dec 10 12:53:58 vmserver15 kernel: loop: Write error at byte offset 37644512256, length 4096.+
            .
            . <the above bold and cursive text is repetead few times>
            .
            Dec 10 12:58:29 vmserver15 kernel: (5638,3):dlmconvert_remote:327 ERROR: dlm status = DLM_IVLOCKID
            Dec 10 12:58:29 vmserver15 kernel: (5638,3):ocfs2_cluster_lock:1206 ERROR: DLM error DLM_IVLOCKID while calling dlmlock on resource M000000000000000001050c00000000: bad lockid
            Dec 10 12:58:29 vmserver15 kernel: (5638,3):ocfs2_inode_lock_full:2064 ERROR: status = -22
            Dec 10 12:58:29 vmserver15 kernel: (5638,3):ocfs2_write_begin:1845 ERROR: status = -22
            Dec 10 12:58:29 vmserver15 kernel: (5638,3):ocfs2_file_buffered_write:2016 ERROR: status = -22
            Dec 10 12:58:29 vmserver15 kernel: (5638,3):__ocfs2_file_aio_write:2173 ERROR: status = -22
            Dec 10 13:01:16 vmserver15 syslogd 1.4.1: restart.

            Node vmserver10: Virtual Server Machine

            Dec 10 12:53:35 vmserver10 kernel: o2net: no longer connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 12:53:35 vmserver10 kernel: (5029,0):ocfs2_dlm_eviction_cb:98 device (8,17): dlm has evicted node 0
            Dec 10 12:53:35 vmserver10 kernel: (20996,0):dlm_get_lock_resource:844 E3FE9E5767CA457FA697980EB637E93B:M000000000000000000022044c1370e: at least one node (0) to recover before lock mastery can begin
            Dec 10 12:53:36 vmserver10 kernel: (5344,4):dlm_get_lock_resource:844 E3FE9E5767CA457FA697980EB637E93B:$RECOVERY: at least one node (0) to recover before lock mastery can begin
            Dec 10 12:53:36 vmserver10 kernel: (5344,4):dlm_get_lock_resource:878 E3FE9E5767CA457FA697980EB637E93B: recovery map is not empty, but must master $RECOVERY lock now
            Dec 10 12:53:36 vmserver10 kernel: (5344,4):dlm_do_recovery:524 (5344) Node 1 is the Recovery Master for the Dead Node 0 for Domain E3FE9E5767CA457FA697980EB637E93B
            Dec 10 12:53:36 vmserver10 kernel: (20996,0):ocfs2_replay_journal:1183 Recovering node 0 from slot 0 on device (8,17)
            Dec 10 12:53:37 vmserver10 kernel: o2net: connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 12:53:38 vmserver10 kernel: (8672,1):dlm_get_lock_resource:844 ovm:$RECOVERY: at least one node (0) to recover before lock mastery can begin
            Dec 10 12:53:38 vmserver10 kernel: (8672,1):dlm_get_lock_resource:878 ovm: recovery map is not empty, but must master $RECOVERY lock now
            Dec 10 12:53:38 vmserver10 kernel: (8672,1):dlm_do_recovery:524 (8672) Node 1 is the Recovery Master for the Dead Node 0 for Domain ovm
            Dec 10 12:53:40 vmserver10 kernel: kjournald starting. Commit interval 5 seconds
            Dec 10 12:53:51 vmserver10 kernel: o2net: no longer connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 12:53:53 vmserver10 kernel: o2net: connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 12:53:53 vmserver10 kernel: (3761,0):dlm_convert_lock_handler:489 ERROR: did not find lock to convert on grant queue! cookie=0:92+
            Dec 10 12:53:53 vmserver10 kernel: lockres: M000000000000000001050c0000000, owner=1, state=0+
            Dec 10 12:53:53 vmserver10 kernel:   last used: 0, refcnt: 3, on purge list: no+
            Dec 10 12:53:53 vmserver10 kernel:   on dirty list: no, on reco list: no, migrating pending: no+
            Dec 10 12:53:53 vmserver10 kernel:   inflight locks: 0, asts reserved: 0+
            *Dec 10 12:53:53 vmserver10 kernel:   refmap nodes: [ ], inflight=0*+
            Dec 10 12:53:53 vmserver10 kernel:   granted queue:+
            Dec 10 12:53:53 vmserver10 kernel:     type=5, conv=-1, node=1, cookie=1:243, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), pending=(conv=n,lock=n,cancel=n,unlock=n)+
            Dec 10 12:53:53 vmserver10 kernel:   converting queue:+
            Dec 10 12:53:53 vmserver10 kernel:   blocked queue:+
            .
            . <the above bold and cursive text is repetead few times>
            .
            Dec 10 12:57:18 vmserver10 modprobe: FATAL: Module ocfs2_stackglue not found.
            Dec 10 12:57:18 vmserver10 kernel: (3761,0):dlm_convert_lock_handler:489 ERROR: did not find lock to convert on grant queue! cookie=0:92+
            Dec 10 12:57:18 vmserver10 kernel: lockres: M000000000000000001050c0000000, owner=1, state=0+
            Dec 10 12:57:18 vmserver10 kernel:   last used: 0, refcnt: 3, on purge list: no+
            Dec 10 12:57:18 vmserver10 kernel:   on dirty list: no, on reco list: no, migrating pending: no+
            Dec 10 12:57:18 vmserver10 kernel:   inflight locks: 0, asts reserved: 0+
            *Dec 10 12:57:18 vmserver10 kernel:   refmap nodes: [ ], inflight=0*+
            Dec 10 12:57:18 vmserver10 kernel:   granted queue:+
            Dec 10 12:57:18 vmserver10 kernel:     type=5, conv=-1, node=1, cookie=1:243, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), pending=(conv=n,lock=n,cancel=n,unlock=n)+
            Dec 10 12:57:18 vmserver10 kernel:   converting queue:+
            Dec 10 12:57:18 vmserver10 kernel:   blocked queue:+
            .
            . <the above bold and cursive text is repetead few times>
            .
            Dec 10 12:58:32 vmserver10 kernel: (3761,0):dlm_unlock_lock_handler:511 ERROR: failed to find lock to unlock! cookie=0:1849
            Dec 10 12:58:33 vmserver10 modprobe: FATAL: Module ocfs2_stackglue not found.
            Dec 10 12:59:02 vmserver10 kernel: o2net: connection to node vmserver15.pic.es (num 0) at 193.109.174.115:7777 has been idle for 30.0 seconds, shutting it down.
            Dec 10 12:59:02 vmserver10 kernel: (0,0):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1260446312.830107 now 1260446342.828243 dr 1260446312.830066 adv 1260446312.830319:1260446312.830320 func (b9f5fd13:506) 1260446312.830109:1260446312.830303)
            Dec 10 12:59:02 vmserver10 kernel: o2net: no longer connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 12:59:32 vmserver10 kernel: (3761,0):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
            Dec 10 13:01:42 vmserver10 kernel: o2net: connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 13:01:45 vmserver10 kernel: ocfs2_dlm: Node 0 joins domain E3FE9E5767CA457FA697980EB637E93B
            Dec 10 13:01:45 vmserver10 kernel: ocfs2_dlm: Nodes in domain ("E3FE9E5767CA457FA697980EB637E93B"): 0 1
            Dec 10 13:01:51 vmserver10 kernel: o2net: accepted connection from node vmserver16.pic.es (num 2) at 193.109.174.116:7777
            Dec 10 13:01:56 vmserver10 kernel: ocfs2_dlm: Node 2 joins domain E3FE9E5767CA457FA697980EB637E93B
            Dec 10 13:01:56 vmserver10 kernel: ocfs2_dlm: Nodes in domain ("E3FE9E5767CA457FA697980EB637E93B"): 0 1 2
            Dec 10 13:09:05 vmserver10 modprobe: FATAL: Module ocfs2_stackglue not found.
            Dec 10 13:16:45 vmserver10 kernel: o2net: connection to node vmserver16.pic.es (num 2) at 193.109.174.116:7777 has been idle for 30.0 seconds, shutting it down.
            Dec 10 13:16:45 vmserver10 kernel: (0,0):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1260447375.655426 now 1260447405.655712 dr 1260447375.655413 adv 1260447375.655427:1260447375.655427 func (b9f5fd13:503) 1260446516.75600:1260446516.75608)
            Dec 10 13:16:45 vmserver10 kernel: o2net: no longer connected to node vmserver16.pic.es (num 2) at 193.109.174.116:7777
            Dec 10 13:17:15 vmserver10 kernel: (3761,0):o2net_connect_expired:1664 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors.
            Dec 10 13:17:19 vmserver10 kernel: (5029,0):ocfs2_dlm_eviction_cb:98 device (8,17): dlm has evicted node 2
            Dec 10 13:17:20 vmserver10 kernel: (3761,0):ocfs2_dlm_eviction_cb:98 device (8,17): dlm has evicted node 2
            Dec 10 13:29:05 vmserver10 kernel: o2net: no longer connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 13:29:05 vmserver10 kernel: (5029,0):ocfs2_dlm_eviction_cb:98 device (8,17): dlm has evicted node 0
            Dec 10 13:29:06 vmserver10 kernel: (5344,4):dlm_get_lock_resource:844 E3FE9E5767CA457FA697980EB637E93B:$RECOVERY: at least one node (0) to recover before lock mastery can begin
            Dec 10 13:29:06 vmserver10 kernel: (5344,4):dlm_get_lock_resource:878 E3FE9E5767CA457FA697980EB637E93B: recovery map is not empty, but must master $RECOVERY lock now
            Dec 10 13:29:06 vmserver10 kernel: (5344,4):dlm_do_recovery:524 (5344) Node 1 is the Recovery Master for the Dead Node 0 for Domain E3FE9E5767CA457FA697980EB637E93B
            Dec 10 13:29:06 vmserver10 kernel: (28412,0):ocfs2_replay_journal:1183 Recovering node 0 from slot 0 on device (8,17)
            Dec 10 13:29:09 vmserver10 kernel: kjournald starting. Commit interval 5 seconds
            Dec 10 13:29:16 vmserver10 kernel: o2net: accepted connection from node vmserver16.pic.es (num 2) at 193.109.174.116:7777
            Dec 10 13:29:20 vmserver10 kernel: ocfs2_dlm: Node 2 joins domain E3FE9E5767CA457FA697980EB637E93B
            Dec 10 13:29:20 vmserver10 kernel: ocfs2_dlm: Nodes in domain ("E3FE9E5767CA457FA697980EB637E93B"): 1 2
            Dec 10 13:32:08 vmserver10 kernel: o2net: connected to node vmserver15.pic.es (num 0) at 193.109.174.115:7777
            Dec 10 13:32:11 vmserver10 kernel: ocfs2_dlm: Node 0 joins domain E3FE9E5767CA457FA697980EB637E93B
            Dec 10 13:32:11 vmserver10 kernel: ocfs2_dlm: Nodes in domain ("E3FE9E5767CA457FA697980EB637E93B"): 0 1 2
            Dec 10 13:36:10 vmserver10 shutdown[28681]: shutting down for system reboot+

            I will investigate what seems to be going on and post it here.

            Thanks for your help.

            Edited by: Marc Caubet on 11-Dec-2009 02:05

            Edited by: Marc Caubet on 11-Dec-2009 02:07
            • 3. Re: Full server pool crashes when adding new iSCSI server
              Avi Miller-Oracle
              Marc Caubet wrote:
              I will investigate what seems to be going on and post it here.
              OCFS2 is evicting nodes, possibly because the network is overwhelmed by both the DLM and iSCSI traffic. Are you running these across the same network interfaces? If so, you should probably look at splitting the network traffic for OCFS2 from the iSCSI traffic. Make sure the /etc/hosts file references the IP address that you want OCFS2 to communicate over for the FQDN. This is the way Oracle VM will auto-configure OCFS2 during server registration. iSCSI doesn't have this requirement and should run across a different NIC/network segment.

              There is also an issue with nodes heartbeating in the wrong slot: check /etc/ocfs2/cluster.conf on all your pool members. It should be the same on all servers. If not, one of the servers is out of sync. You can use the "Restore" option in Oracle VM Manager to propagate the configuration for the pool back down to all pool members. This should rectify the issue and allow you to add the third server successfully.
              • 4. Re: Full server pool crashes when adding new iSCSI server
                733736
                Are you running these across the same network interfaces?

                Correct, DLM and iSCSI traffic are using the same network interface. At the moment this networking interface is almost lazy (really low traffic load). Actually, we don't have extra available interfaces for those hypervisors, so we can not split this yet, but we soon will upgrade our machines with new extra-network cards so we'll be able to try it (I guess we will need to create iptable ACLs to split this traffic)

                By the way, an example of our current hosts file is the following
                +*[root@vmserver10 network-scripts]# cat /etc/hosts*+
                *# Do not remove the following line, or various programs*
                *# that require network functionality will fail.*
                *127.0.0.1          localhost.localdomain localhost*
                *::1          localhost6.localdomain6 localhost6*
                *193.109.174.110          vmserver10.pic.es vmserver10*

                There is also an issue with nodes heartbeating in the wrong slot: check /etc/ocfs2/cluster.conf on all your pool members. It should be the same on all servers. If not, one of the servers is out of sync. You can use the "Restore" option in Oracle VM Manager to propagate the configuration for the pool back down to all pool members. This should rectify the issue and allow you to add the third server successfully.

                Completely true! We saw that the cluster.conf was not consistent in the 4 nodes: some contained servers which were not defined on other cluster.conf files from servers theorically in the same server pool. I have used the Restore pool option before applying changes in the node, but still having problems.

                Below I show log file corresponding to the new added node when add the iSCSI partition

                **[root@vmserver16 ovs]# iscsiadm -m discovery -t sendtargets -p disk001**
                +193.109.174.106:3260,1 iqn.1986-03.com.sun:02:5deeff0d-cae2-419b-ff39-a30b4db0026d+
                +*[root@vmserver16 ovs]# /etc/init.d/iscsi restart*+
                Stopping iSCSI daemon:
                +iscsid dead but pid file exists                            [  OK  ]+
                +Turning off network shutdown. Starting iSCSI daemon:       [  OK  ]+
                +[  OK  ]+
                +Setting up iSCSI targets: Logging in to [iface: default, target: iqn.1986-03.com.sun:02:5deeff0d-cae2-419b-ff39-a30b4db0026d, portal: 193.109.174.106,3260]+
                +Login to [iface: default, target: iqn.1986-03.com.sun:02:5deeff0d-cae2-419b-ff39-a30b4db0026d, portal: 193.109.174.106,3260]: successful+
                +[  OK  ]+
                +*[root@vmserver16 ovs]# /opt/ovs-agent-2.3/utils/repos.py --new /dev/sdb2*+

                +/var/log/messages+*
                Dec 14 10:45:03 vmserver16 modprobe: FATAL: Module ocfs2_stackglue not found.+
                Dec 14 10:45:03 vmserver16 kernel: OCFS2 DLMFS 1.4.4+
                Dec 14 10:45:03 vmserver16 kernel: OCFS2 User DLM kernel interface loaded+
                Dec 14 10:45:05 vmserver16 kernel: (5721,0):o2hb_do_disk_heartbeat:776 ERROR: Device "sdb2": another node is heartbeating in our slot!+
                Dec 14 10:45:09 vmserver16 last message repeated 2 times+
                Dec 14 10:45:09 vmserver16 kernel: ocfs2_dlm: Nodes in domain ("DCF5614D641C4FBBB39E571B674DCD8D"): 0+
                Dec 14 10:45:09 vmserver16 kernel: (5720,0):ocfs2_find_slot:249 slot 0 is already allocated to this node!+
                Dec 14 10:45:09 vmserver16 kernel: (5720,0):ocfs2_check_volume:1934 File system was not unmounted cleanly, recovering volume.+
                Dec 14 10:45:09 vmserver16 kernel: kjournald starting.  Commit interval 5 seconds+
                Dec 14 10:45:09 vmserver16 kernel: ocfs2: Mounting device (8,18) on (node 0, slot 0) with ordered data mode.+
                Dec 14 10:45:09 vmserver16 kernel: (5728,3):ocfs2_replay_journal:1183 Recovering node 1 from slot 1 on device (8,18)+
                Dec 14 10:45:11 vmserver16 kernel: (5721,0):o2hb_do_disk_heartbeat:776 ERROR: Device "sdb2": another node is heartbeating in our slot!+
                Dec 14 10:45:13 vmserver16 kernel: kjournald starting.  Commit interval 5 seconds+
                Dec 14 10:45:13 vmserver16 kernel: (5721,0):o2hb_do_disk_heartbeat:776 ERROR: Device "sdb2": another node is heartbeating in our slot!+
                Dec 14 10:45:17 vmserver16 last message repeated 2 times+
                Dec 14 10:45:19 vmserver16 kernel: ocfs2: Unmounting device (8,18) on (node 0)+

                I guess the followed procedure is correct:

                Step 1 - Discover iSCSI partition in the new hypervisor to be added (vmserver16)
                Step 2 - repos.py --new <iSCSI_device> in the new hypervisor to be added (vmserver16)
                Step 3 - repos.py --root <UUID_iSCSI_device> in the new hypervisor to be added (vmserver16)
                Step 4 - From the manager, add new server (vmserver16) to the existing server pool.

                Is this right?

                Edited by: Marc Caubet on 14-Dec-2009 01:59
                • 5. Re: Full server pool crashes when adding new iSCSI server
                  Avi Miller-Oracle
                  Marc Caubet wrote:
                  I guess the followed procedure is correct:
                  Is this right?
                  No -- that process is not correct. You would do that only on the Server Pool Master. For additional Oracle VM Servers, you'd only do step 1 and step 4. The rest of it is done automatically by ovs-agent as the server joins the pool. Also, your /etc/hosts file is not complete: you need entries for all the other Oracle VM Servers in your pool as well. This is to ensure that OCFS2 can map the hostname to the correct IP address as it builds the cluster.conf on startup.

                  In this situation, you now have a broken OCFS2 volume. My recommendation is to use the ./repos.py script on the Pool Master to delete that repository, which should then get propogated throughout the cluster. Once you've done that, reformat it to remove all the node information. Then, re-add using ./repos.py on the Pool Master only and restart ovs-agent on the Pool Master. Once the Pool Master has remounted the OCFS2 volume and created the /OVS symlink, use the Restore button to send the new configuration out to the remaining nodes.

                  For further analysis, can you check /var/log/ovs-agent/ovs_operation.log on each node to ensure that it mounts the OCFS2 volume correctly.
                  • 6. Re: Full server pool crashes when adding new iSCSI server
                    733736
                    Hi Avi,

                    ok thanks a lot for your reply. So was really doing things in the wrong way.

                    I'll redo all changes.

                    Best regards,
                    Marc
                    • 7. Re: Full server pool crashes when adding new iSCSI server
                      733736
                      I tried to configure my clusters as you said, and now is working correctly.

                      Thanks!
                      • 8. Re: Full server pool crashes when adding new iSCSI server
                        Avi Miller-Oracle
                        Marc Caubet wrote:
                        I tried to configure my clusters as you said, and now is working correctly.
                        Fantastic, glad to hear it. :)