1 Reply Latest reply: Apr 29, 2013 1:46 AM by 1006025 RSS

    Node does not join cluster upon reboot

    user9071317
      Hi Guys,

      I have two servers [Sun Fire X4170] clustered together using Solaris cluster 3.3 for Oracle Database. They are connected to a shared storage which is Dell Equallogic [iSCSI]. Lately, I have ran into a weird kind of a problem where as both nodes come up fine and join the cluster upon reboot; however, when I reboot one of nodes then any of them does not join cluster and shows following errors:

      This is happening on both the nodes [if I reboot only one node at a time]. But if I reboot both the nodes at the same time then they successfully join the cluster and everything runs fine.

      Below is the output from one node which I rebooted and it did not join the cluster and puked out following errors. The other node is running fine will all the services.
      In order to get out of this situation, I have to reboot both the nodes together.

      # dmesg output #
      Apr 23 17:37:03 srvhqon11 ixgbe: [ID 611667 kern.info] NOTICE: ixgbe2: link down
      Apr 23 17:37:12 srvhqon11 iscsi: [ID 933263 kern.notice] NOTICE: iscsi connection(5) unable to connect to target SENDTARGETS_DISCOVERY
      Apr 23 17:37:12 srvhqon11 iscsi: [ID 114404 kern.notice] NOTICE: iscsi discovery failure - SendTargets (010.010.017.104)
      Apr 23 17:37:13 srvhqon11 iscsi: [ID 240218 kern.notice] NOTICE: iscsi session(9) iqn.2001-05.com.equallogic:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk online
      Apr 23 17:37:13 srvhqon11 scsi: [ID 583861 kern.info] sd11 at scsi_vhci0: unit-address g6090a0887073cf961b0ae505000030ef: g6090a0887073cf961b0ae505000030ef
      Apr 23 17:37:13 srvhqon11 genunix: [ID 936769 kern.info] sd11 is /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef
      Apr 23 17:37:13 srvhqon11 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
      Apr 23 17:37:13 srvhqon11 /scsi_vhci/disk@g6090a0887073cf961b0ae505000030ef (sd11): Command failed to complete (3) on path iscsi0/disk@0000iqn.2001-05.com.equallogic:0-8a0906-96cf73708-ef30000005e50a1b-sblprdbk0001,0
      Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 122153 daemon.warning] svc:/network/iscsi/initiator:default: Method or service exit timed out. Killing contract 41.
      Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 636263 daemon.warning] svc:/network/iscsi/initiator:default: Method "/lib/svc/method/iscsid start" failed due to signal KILL.
      Apr 23 17:46:54 srvhqon11 svc.startd[11]: [ID 748625 daemon.error] network/iscsi/initiator:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)
      Apr 24 14:50:16 srvhqon11 svc.startd[11]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 1

      root@srvhqon11 # svcs -xv
      svc:/system/cluster/loaddid:default (Oracle Solaris Cluster loaddid)
      State: offline since Tue Apr 23 17:46:54 2013
      Reason: Start method is running.
      See: http://sun.com/msg/SMF-8000-C4
      See: /var/svc/log/system-cluster-loaddid:default.log
      Impact: 49 dependent services are not running:
      svc:/system/cluster/bootcluster:default
      svc:/system/cluster/cl_execd:default
      svc:/system/cluster/zc_cmd_log_replay:default
      svc:/system/cluster/sc_zc_member:default
      svc:/system/cluster/sc_rtreg_server:default
      svc:/system/cluster/sc_ifconfig_server:default
      svc:/system/cluster/initdid:default
      svc:/system/cluster/globaldevices:default
      svc:/system/cluster/gdevsync:default
      svc:/milestone/multi-user:default
      svc:/system/boot-config:default
      svc:/system/cluster/cl-svc-enable:default
      svc:/milestone/multi-user-server:default
      svc:/application/autoreg:default
      svc:/system/basicreg:default
      svc:/system/zones:default
      svc:/system/cluster/sc_zones:default
      svc:/system/cluster/scprivipd:default
      svc:/system/cluster/cl-svc-cluster-milestone:default
      svc:/system/cluster/sc_svtag:default
      svc:/system/cluster/sckeysync:default
      svc:/system/cluster/rpc-fed:default
      svc:/system/cluster/rgm-starter:default
      svc:/application/management/common-agent-container-1:default
      svc:/system/cluster/scsymon-srv:default
      svc:/system/cluster/sc_syncsa_server:default
      svc:/system/cluster/scslmclean:default
      svc:/system/cluster/cznetd:default
      svc:/system/cluster/scdpm:default
      svc:/system/cluster/rpc-pmf:default
      svc:/system/cluster/pnm:default
      svc:/system/cluster/sc_pnm_proxy_server:default
      svc:/system/cluster/cl-event:default
      svc:/system/cluster/cl-eventlog:default
      svc:/system/cluster/cl-ccra:default
      svc:/system/cluster/ql_upgrade:default
      svc:/system/cluster/mountgfs:default
      svc:/system/cluster/clusterdata:default
      svc:/system/cluster/ql_rgm:default
      svc:/system/cluster/scqdm:default
      svc:/application/stosreg:default
      svc:/application/sthwreg:default
      svc:/application/graphical-login/cde-login:default
      svc:/application/cde-printinfo:default
      svc:/system/cluster/scvxinstall:default
      svc:/system/cluster/sc_failfast:default
      svc:/system/cluster/clexecd:default
      svc:/system/cluster/sc_pmmd:default
      svc:/system/cluster/clevent_listenerd:default

      svc:/application/print/server:default (LP print server)
      State: disabled since Tue Apr 23 17:36:44 2013
      Reason: Disabled by an administrator.
      See: http://sun.com/msg/SMF-8000-05
      See: man -M /usr/share/man -s 1M lpsched
      Impact: 2 dependent services are not running:
      svc:/application/print/rfc1179:default
      svc:/application/print/ipp-listener:default

      svc:/network/iscsi/initiator:default (?)
      State: maintenance since Tue Apr 23 17:46:54 2013
      Reason: Restarting too quickly.
      See: http://sun.com/msg/SMF-8000-L5
      See: /var/svc/log/network-iscsi-initiator:default.log
      Impact: This service is not running.

      ######## Cluster Status from working node ############

      root@srvhqon10 # cluster status

      === Cluster Nodes ===

      --- Node Status ---

      Node Name Status
      --------- ------
      srvhqon10 Online
      srvhqon11 Offline


      === Cluster Transport Paths ===

      Endpoint1 Endpoint2 Status
      --------- --------- ------
      srvhqon10:igb3 srvhqon11:igb3 faulted
      srvhqon10:igb2 srvhqon11:igb2 faulted


      === Cluster Quorum ===

      --- Quorum Votes Summary from (latest node reconfiguration) ---

      Needed Present Possible
      ------ ------- --------
      2 2 3


      --- Quorum Votes by Node (current status) ---

      Node Name Present Possible Status
      --------- ------- -------- ------
      srvhqon10 1 1 Online
      srvhqon11 0 1 Offline


      --- Quorum Votes by Device (current status) ---

      Device Name Present Possible Status
      ----------- ------- -------- ------
      d2 1 1 Online


      === Cluster Device Groups ===

      --- Device Group Status ---

      Device Group Name Primary Secondary Status
      ----------------- ------- --------- ------


      --- Spare, Inactive, and In Transition Nodes ---

      Device Group Name Spare Nodes Inactive Nodes In Transistion Nodes
      ----------------- ----------- -------------- --------------------


      --- Multi-owner Device Group Status ---

      Device Group Name Node Name Status
      ----------------- --------- ------

      === Cluster Resource Groups ===

      Group Name Node Name Suspended State
      ---------- --------- --------- -----
      ora-rg srvhqon10 No Online
      srvhqon11 No Offline

      nfs-rg srvhqon10 No Online
      srvhqon11 No Offline

      backup-rg srvhqon10 No Online
      srvhqon11 No Offline


      === Cluster Resources ===

      Resource Name Node Name State Status Message
      ------------- --------- ----- --------------
      ora-listener srvhqon10 Online Online
      srvhqon11 Offline Offline

      ora-server srvhqon10 Online Online
      srvhqon11 Offline Offline

      ora-stor srvhqon10 Online Online
      srvhqon11 Offline Offline

      ora-lh srvhqon10 Online Online - LogicalHostname online.
      srvhqon11 Offline Offline

      nfs-rs srvhqon10 Online Online - Service is online.
      srvhqon11 Offline Offline

      nfs-stor-rs srvhqon10 Online Online
      srvhqon11 Offline Offline

      nfs-lh-rs srvhqon10 Online Online - LogicalHostname online.
      srvhqon11 Offline Offline

      backup-stor srvhqon10 Online Online
      srvhqon11 Offline Offline

      cluster: (C383355) No response from daemon on node "srvhqon11".

      === Cluster DID Devices ===

      Device Instance Node Status
      --------------- ---- ------
      /dev/did/rdsk/d1 srvhqon10 Ok

      /dev/did/rdsk/d2 srvhqon10 Ok
      srvhqon11 Unknown

      /dev/did/rdsk/d3 srvhqon10 Ok
      srvhqon11 Unknown

      /dev/did/rdsk/d4 srvhqon10 Ok

      /dev/did/rdsk/d5 srvhqon10 Fail
      srvhqon11 Unknown

      /dev/did/rdsk/d6 srvhqon11 Unknown

      /dev/did/rdsk/d7 srvhqon11 Unknown

      /dev/did/rdsk/d8 srvhqon10 Ok
      srvhqon11 Unknown

      /dev/did/rdsk/d9 srvhqon10 Ok
      srvhqon11 Unknown


      === Zone Clusters ===

      --- Zone Cluster Status ---

      Name Node Name Zone HostName Status Zone Status
      ---- --------- ------------- ------ -----------

      Regards.
        • 1. Re: Node does not join cluster upon reboot
          1006025
          check if your global devices are mounted properly

          #cat /etc/mnttab | grep -i global

          check if proper entries are there on both systems

          #cat /etc/vfstab | grep -i global

          give output for quoram devices .

          #scstat -q
          or
          #clquorum list -v


          also check why your scsi initiator service is going offline unexpectedly

          #vi /var/svc/log/network-iscsi-initiator:default.log