I have two servers [Sun Fire X4170] clustered together using Solaris cluster 3.3 for Oracle Database. They are connected to a shared storage which is Dell Equallogic [iSCSI]. Lately, I have ran into a weird kind of a problem where as both nodes come up fine and join the cluster upon reboot; however, when I reboot one of nodes then any of them does not join cluster and shows following errors:
This is happening on both the nodes [if I reboot only one node at a time]. But if I reboot both the nodes at the same time then they successfully join the cluster and everything runs fine.
Below is the output from one node which I rebooted and it did not join the cluster and puked out following errors. The other node is running fine will all the services.
In order to get out of this situation, I have to reboot both the nodes together.
# dmesg output #
Apr 23 17:37:03 srvhqon11 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe2: link down
Apr 23 17:37:12 srvhqon11 iscsi: [ID 933263 kern.notice] NOTICE: iscsi connection(5) unable to connect to target SENDTARGETS_DISCOVERY
Apr 23 17:37:12 srvhqon11 iscsi: [ID 114404 kern.notice] NOTICE: iscsi discovery failure - SendTargets (010.010.017.104)
Apr 23 17:37:13 srvhqon11 iscsi: [ID 240218 kern.notice] NOTICE: iscsi session(9) iqn.2001-05.com.equallogic:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk online
Apr 23 17:37:13 srvhqon11 scsi: [ID 583861 kern.info] sd11 at scsi_vhci0: unit-address g6090a0887073cf961b0ae505000030ef: g6090a0887073cf961b0ae505000030ef
Apr 23 17:37:13 srvhqon11 genunix: [ID 936769 kern.info] sd11 is /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef
Apr 23 17:37:13 srvhqon11 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Apr 23 17:37:13 srvhqon11 /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef (sd11): Command failed to complete (3) on path email@example.com:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk0001,0
Apr 23 17:46:54 srvhqon11 svc.startd: [ID 122153 daemon.warning] svc:/network/iscsi/initiator:default: Method or service exit timed out. Killing contract 41.
Apr 23 17:46:54 srvhqon11 svc.startd: [ID 636263 daemon.warning] svc:/network/iscsi/initiator:default: Method "/lib/svc/method/iscsid start" failed due to signal KILL.
Apr 23 17:46:54 srvhqon11 svc.startd: [ID 748625 daemon.error] network/iscsi/initiator:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)
Apr 24 14:50:16 srvhqon11 svc.startd: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 1
svc:/application/print/server:default (LP print server)
State: disabled since Tue Apr 23 17:36:44 2013
Reason: Disabled by an administrator.
See: man -M /usr/share/man -s 1M lpsched
Impact: 2 dependent services are not running:
State: maintenance since Tue Apr 23 17:46:54 2013
Reason: Restarting too quickly.
Impact: This service is not running.
######## Cluster Status from working node ############
root@srvhqon10 # cluster status
=== Cluster Nodes ===
--- Node Status ---
Node Name Status