This discussion is archived
1 Reply Latest reply: Apr 28, 2013 11:46 PM by 1006025 RSS

Node does not join cluster upon reboot

1005115 Newbie
Currently Being Moderated
Hi Guys,

I have two servers [Sun Fire X4170] clustered together using Solaris cluster 3.3 for Oracle Database. They are connected to a shared storage which is Dell Equallogic [iSCSI]. Lately, I have ran into a weird kind of a problem where as both nodes come up fine and join the cluster upon reboot; however, when I reboot one of nodes then any of them does not join cluster and shows following errors:

This is happening on both the nodes [if I reboot only one node at a time]. But if I reboot both the nodes at the same time then they successfully join the cluster and everything runs fine.

Below is the output from one node which I rebooted and it did not join the cluster and puked out following errors. The other node is running fine will all the services.
In order to get out of this situation, I have to reboot both the nodes together.

# dmesg output #
Apr 23 17:37:03 srvhqon11 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe2: link down
Apr 23 17:37:12 srvhqon11 iscsi: [ID 933263 kern.notice] NOTICE: iscsi connection(5) unable to connect to target SENDTARGETS_DISCOVERY
Apr 23 17:37:12 srvhqon11 iscsi: [ID 114404 kern.notice] NOTICE: iscsi discovery failure - SendTargets (010.010.017.104)
Apr 23 17:37:13 srvhqon11 iscsi: [ID 240218 kern.notice] NOTICE: iscsi session(9) iqn.2001-05.com.equallogic:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk online
Apr 23 17:37:13 srvhqon11 scsi: [ID 583861 kern.info] sd11 at scsi_vhci0: unit-address g6090a0887073cf961b0ae505000030ef: g6090a0887073cf961b0ae505000030ef
Apr 23 17:37:13 srvhqon11 genunix: [ID 936769 kern.info] sd11 is /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef
Apr 23 17:37:13 srvhqon11 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Apr 23 17:37:13 srvhqon11 /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef (sd11): Command failed to complete (3) on path iscsi0/disk@0000iqn.2001-05.com.equallogic:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk0001,0
Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 122153 daemon.warning] svc:/network/iscsi/initiator:default: Method or service exit timed out. Killing contract 41.
Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 636263 daemon.warning] svc:/network/iscsi/initiator:default: Method "/lib/svc/method/iscsid start" failed due to signal KILL.
Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 748625 daemon.error] network/iscsi/initiator:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)
Apr 24 14:50:16 srvhqon11 svc.startd[11]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 1

root@srvhqon11 # svcs -xv
svc:/system/cluster/loaddid:default (Oracle Solaris Cluster loaddid)
State: offline since Tue Apr 23 17:46:54 2013
Reason: Start method is running.
See: http://sun.com/msg/SMF-8000-C4
See: /var/svc/log/system-cluster-loaddid:default.log
Impact: 49 dependent services are not running:
svc:/system/cluster/bootcluster:default
svc:/system/cluster/cl_execd:default
svc:/system/cluster/zc_cmd_log_replay:default
svc:/system/cluster/sc_zc_member:default
svc:/system/cluster/sc_rtreg_server:default
svc:/system/cluster/sc_ifconfig_server:default
svc:/system/cluster/initdid:default
svc:/system/cluster/globaldevices:default
svc:/system/cluster/gdevsync:default
svc:/milestone/multi-user:default
svc:/system/boot-config:default
svc:/system/cluster/cl-svc-enable:default
svc:/milestone/multi-user-server:default
svc:/application/autoreg:default
svc:/system/basicreg:default
svc:/system/zones:default
svc:/system/cluster/sc_zones:default
svc:/system/cluster/scprivipd:default
svc:/system/cluster/cl-svc-cluster-milestone:default
svc:/system/cluster/sc_svtag:default
svc:/system/cluster/sckeysync:default
svc:/system/cluster/rpc-fed:default
svc:/system/cluster/rgm-starter:default
svc:/application/management/common-agent-container-1:default
svc:/system/cluster/scsymon-srv:default
svc:/system/cluster/sc_syncsa_server:default
svc:/system/cluster/scslmclean:default
svc:/system/cluster/cznetd:default
svc:/system/cluster/scdpm:default
svc:/system/cluster/rpc-pmf:default
svc:/system/cluster/pnm:default
svc:/system/cluster/sc_pnm_proxy_server:default
svc:/system/cluster/cl-event:default
svc:/system/cluster/cl-eventlog:default
svc:/system/cluster/cl-ccra:default
svc:/system/cluster/ql_upgrade:default
svc:/system/cluster/mountgfs:default
svc:/system/cluster/clusterdata:default
svc:/system/cluster/ql_rgm:default
svc:/system/cluster/scqdm:default
svc:/application/stosreg:default
svc:/application/sthwreg:default
svc:/application/graphical-login/cde-login:default
svc:/application/cde-printinfo:default
svc:/system/cluster/scvxinstall:default
svc:/system/cluster/sc_failfast:default
svc:/system/cluster/clexecd:default
svc:/system/cluster/sc_pmmd:default
svc:/system/cluster/clevent_listenerd:default

svc:/application/print/server:default (LP print server)
State: disabled since Tue Apr 23 17:36:44 2013
Reason: Disabled by an administrator.
See: http://sun.com/msg/SMF-8000-05
See: man -M /usr/share/man -s 1M lpsched
Impact: 2 dependent services are not running:
svc:/application/print/rfc1179:default
svc:/application/print/ipp-listener:default

svc:/network/iscsi/initiator:default (?)
State: maintenance since Tue Apr 23 17:46:54 2013
Reason: Restarting too quickly.
See: http://sun.com/msg/SMF-8000-L5
See: /var/svc/log/network-iscsi-initiator:default.log
Impact: This service is not running.

######## Cluster Status from working node ############

root@srvhqon10 # cluster status

=== Cluster Nodes ===

--- Node Status ---

Node Name Status
--------- ------
srvhqon10 Online
srvhqon11 Offline


=== Cluster Transport Paths ===

Endpoint1 Endpoint2 Status
--------- --------- ------
srvhqon10:igb3 srvhqon11:igb3 faulted
srvhqon10:igb2 srvhqon11:igb2 faulted


=== Cluster Quorum ===

--- Quorum Votes Summary from (latest node reconfiguration) ---

Needed Present Possible
------ ------- --------
2 2 3


--- Quorum Votes by Node (current status) ---

Node Name Present Possible Status
--------- ------- -------- ------
srvhqon10 1 1 Online
srvhqon11 0 1 Offline


--- Quorum Votes by Device (current status) ---

Device Name Present Possible Status
----------- ------- -------- ------
d2 1 1 Online


=== Cluster Device Groups ===

--- Device Group Status ---

Device Group Name Primary Secondary Status
----------------- ------- --------- ------


--- Spare, Inactive, and In Transition Nodes ---

Device Group Name Spare Nodes Inactive Nodes In Transistion Nodes
----------------- ----------- -------------- --------------------


--- Multi-owner Device Group Status ---

Device Group Name Node Name Status
----------------- --------- ------

=== Cluster Resource Groups ===

Group Name Node Name Suspended State
---------- --------- --------- -----
ora-rg srvhqon10 No Online
srvhqon11 No Offline

nfs-rg srvhqon10 No Online
srvhqon11 No Offline

backup-rg srvhqon10 No Online
srvhqon11 No Offline


=== Cluster Resources ===

Resource Name Node Name State Status Message
------------- --------- ----- --------------
ora-listener srvhqon10 Online Online
srvhqon11 Offline Offline

ora-server srvhqon10 Online Online
srvhqon11 Offline Offline

ora-stor srvhqon10 Online Online
srvhqon11 Offline Offline

ora-lh srvhqon10 Online Online - LogicalHostname online.
srvhqon11 Offline Offline

nfs-rs srvhqon10 Online Online - Service is online.
srvhqon11 Offline Offline

nfs-stor-rs srvhqon10 Online Online
srvhqon11 Offline Offline

nfs-lh-rs srvhqon10 Online Online - LogicalHostname online.
srvhqon11 Offline Offline

backup-stor srvhqon10 Online Online
srvhqon11 Offline Offline

cluster: (C383355) No response from daemon on node "srvhqon11".

=== Cluster DID Devices ===

Device Instance Node Status
--------------- ---- ------
/dev/did/rdsk/d1 srvhqon10 Ok

/dev/did/rdsk/d2 srvhqon10 Ok
srvhqon11 Unknown

/dev/did/rdsk/d3 srvhqon10 Ok
srvhqon11 Unknown

/dev/did/rdsk/d4 srvhqon10 Ok

/dev/did/rdsk/d5 srvhqon10 Fail
srvhqon11 Unknown

/dev/did/rdsk/d6 srvhqon11 Unknown

/dev/did/rdsk/d7 srvhqon11 Unknown

/dev/did/rdsk/d8 srvhqon10 Ok
srvhqon11 Unknown

/dev/did/rdsk/d9 srvhqon10 Ok
srvhqon11 Unknown


=== Zone Clusters ===

--- Zone Cluster Status ---

Name Node Name Zone HostName Status Zone Status
---- --------- ------------- ------ -----------

Regards.
  • 1. Re: Node does not join cluster upon reboot
    1006025 Newbie
    Currently Being Moderated
    check if your global devices are mounted properly

    #cat /etc/mnttab | grep -i global

    check if proper entries are there on both systems

    #cat /etc/vfstab | grep -i global

    give output for quoram devices .

    #scstat -q
    or
    #clquorum list -v


    also check why your scsi initiator service is going offline unexpectedly

    #vi /var/svc/log/network-iscsi-initiator:default.log

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points