10 Replies Latest reply: Apr 5, 2012 10:50 AM by 924908 RSS

    SC 3.3: How to reinstall second node

    924908
      Hi,

      there are two nodes: nodeA and nodeB.

      The cluster is working fine; there are two metasets for the failover filesystems.

      Unfortunately the second nodeB has got a hardware error and we have to completely substitute it
      with a new hardware.

      My question is what are the steps to join the second nodeB ( after reinstall ) to the existing cluster configuration,
      because nodeB is already in the cluster configuration?

      Do I have to unconfigure nodeB from the cluster configuration before I reinstall nodeB again?
      If yes, what are the steps?

      What should I do with the metasets before and after nodeB is reinstalled?

      Any help would be appreciated.

      Heinz

      Edited by: 921905 on 19.03.2012 10:24
        • 1. Re: SC 3.3: How to reinstall second node
          807928
          Yes, I think you'd have to remove nodeB from the installation, reinstall and rejoin unless of course you have a full backup.

          If you have a full backup, you might be able to do a restore and reboot. For the metaset, you'd need to remove nodeB from the metaset first and then re-add it. That can only be done (IIRC) with nodeB down.

          However, I'm pretty sure that this is all documented in the manuals, so please check the correct procedure there.

          http://docs.oracle.com/cd/E18728_01/html/821-2847/cacjggea.html#scrolltoc

          Tim
          ---
          • 2. Re: SC 3.3: How to reinstall second node
            924908
            Thanks for the quick response and the link.

            If we unconfigure nodeB from the cluster configuration as mentioned in the link,
            is the cluster resetted to install mode and after reinstalling and joining nodeB
            have I to reset the install mode again?

            Heinz
            • 3. Re: SC 3.3: How to reinstall second node
              807928
              Not as far as I know. That's because the cluster has been installed and has a quorum device (or server) allocated. As such, that will keep the one node cluster stable while you replace the broken node.

              Regards,

              Tim
              ---
              • 4. Re: SC 3.3: How to reinstall second node
                924908
                Tim,

                after reinstalling the second nodeB we are able to online and switch our resources between the two nodes! Thanks for helping us.

                But now we've got another problem:

                When we "init 0" one of the two nodes, the resourcegroups are brought online on the remaining node but we got this error:

                "Device service sybase associated with path sybase is Degraded."

                [root]/var/adm {NSH}: cldevicegroup status

                === Cluster Device Groups ===

                --- Device Group Status ---

                Device Group Name Primary Secondary Status

                kgr nodeB - Degraded
                sybase nodeB - Degraded

                Hre is the output of the metaset:
                [root]/var/adm {NSH}: metaset

                Set name = kgr, Set number = 1

                Host Owner
                nodeA
                nodeB Yes

                Mediator Host(s) Aliases
                nodeA
                nodeB

                Drive Dbase

                d3 Yes

                d4 Yes

                d10 Yes

                d11 Yes

                Set name = sybase, Set number = 2

                Host Owner
                nodeA
                nodeB Yes

                Mediator Host(s) Aliases
                nodeA
                nodeB

                Drive Dbase

                d5 Yes

                d6 Yes

                d7 Yes

                d8 Yes

                d9 Yes

                d12 Yes

                d13 Yes

                d14 Yes

                d15 Yes

                d16 Yes


                Seems everything ok with the devices. When I manually mount the filesystems everything is fine.

                Any hints?

                Regards,
                Heinz

                Edited by: 921905 on 26.03.2012 07:09
                • 5. Re: SC 3.3: How to reinstall second node
                  807928
                  Is the output from metaset the same from both nodes? What about the output from:

                  # metadb -s sybase -i
                  # metadb -s kgr -i

                  Have you tried running:
                  # cldev refresh
                  # cldev populate

                  on both nodes? Just some suggestions.

                  Tim
                  ---
                  • 6. Re: SC 3.3: How to reinstall second node
                    924908
                    Semms to be ok:

                    krn630[root]~ {NSH}: metadb -s sybase -i
                    Proxy command to: krn730
                    flags first blk block count
                    a m luo 16 8192 /dev/did/dsk/d5s7
                    a luo 16 8192 /dev/did/dsk/d6s7
                    a luo 16 8192 /dev/did/dsk/d7s7
                    a luo 16 8192 /dev/did/dsk/d8s7
                    a luo 16 8192 /dev/did/dsk/d9s7
                    a luo 16 8192 /dev/did/dsk/d12s7
                    a luo 16 8192 /dev/did/dsk/d13s7
                    a luo 16 8192 /dev/did/dsk/d14s7
                    a luo 16 8192 /dev/did/dsk/d15s7
                    a luo 16 8192 /dev/did/dsk/d16s7



                    krn730[root]/var/adm {NSH}: metadb -s sybase -i
                    flags first blk block count
                    a m luo 16 8192 /dev/did/dsk/d5s7
                    a luo 16 8192 /dev/did/dsk/d6s7
                    a luo 16 8192 /dev/did/dsk/d7s7
                    a luo 16 8192 /dev/did/dsk/d8s7
                    a luo 16 8192 /dev/did/dsk/d9s7
                    a luo 16 8192 /dev/did/dsk/d12s7
                    a luo 16 8192 /dev/did/dsk/d13s7
                    a luo 16 8192 /dev/did/dsk/d14s7
                    a luo 16 8192 /dev/did/dsk/d15s7
                    a luo 16 8192 /dev/did/dsk/d16s7


                    When I "init 0" the other node (krn630 =nodeA ) the following messages appear on the remaining nodeB (krn730 ) in /var/adm/messages . The resource for the metaset sybase is sybase-dg:

                    Mar 26 17:16:42 krn730 cl_runtime: [ID 273354 kern.notice] NOTICE: CMM: Node krn630 (nodeid = 1) is dead.
                    Mar 26 17:16:47 krn730 cl_runtime: [ID 489438 kern.notice] NOTICE: clcomm: Path krn730:nxge1 - krn630:nxge1 being drained
                    Mar 26 17:16:47 krn730 cl_runtime: [ID 489438 kern.notice] NOTICE: clcomm: Path krn730:nxge3 - krn630:nxge3 being drained
                    Mar 26 17:16:47 krn730 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 000.000.000.000:0, remote = 172.016.004.001:0, start = -2, end = 6
                    Mar 26 17:16:47 krn730 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN: aborted 0 connection
                    Mar 26 17:16:48 krn730 cl_runtime: [ID 446068 kern.notice] NOTICE: CMM: Node krn630 (nodeid = 1) is down.
                    Mar 26 17:16:48 krn730 cl_runtime: [ID 108990 kern.notice] NOTICE: CMM: Cluster members: krn730.
                    Mar 26 17:16:48 krn730 cl_runtime: [ID 279084 kern.notice] NOTICE: CMM: node reconfiguration #4 completed.
                    Mar 26 17:16:48 krn730 cl_runtime: [ID 250885 kern.notice] NOTICE: CMM: Quorum device /dev/did/rdsk/d4s2: owner set to node 2.
                    Mar 26 17:16:48 krn730 Cluster.Framework: [ID 801593 daemon.notice] stdout: fencing node krn630 from shared devices
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d3s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d5s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d6s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d7s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d8s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d9s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d10s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d11s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d12s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d13s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d14s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d15s2
                    Mar 26 17:16:48 krn730 Cluster.CCR: [ID 651093 daemon.warning] reservation message(fence_node) - Fencing node 1 from disk /dev/did/rdsk/d16s2
                    Mar 26 17:16:55 krn730 Cluster.RGM.global.rgmd: [ID 676558 daemon.warning] WARNING: Global_resources_used property of resource group <localstuff-rg> is set to non-null string, assuming wildcard
                    Mar 26 17:16:55 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource IP_krn733 status msg on node krn630 change to <>
                    Mar 26 17:16:55 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource IP_krn633 status msg on node krn630 change to <>
                    Mar 26 17:17:32 krn730 SC[,SUNW.HAStoragePlus:9,sybase-rg,sybase-dg,hastorageplus_probe]: [ID 419839 daemon.error] Device service sybase associated with path sybase is Degraded.
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_DEGRADED
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <Device service sybase associated with path sybase is Degraded.>
                    Mar 26 17:17:32 krn730 SC[,SUNW.HAStoragePlus:9,sybase-rg,sybase-dg,hastorageplus_probe]: [ID 100000 daemon.error]
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_FAULTED
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <>
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_DEGRADED
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <Service is degraded.>
                    Mar 26 17:17:32 krn730 SC[,SUNW.HAStoragePlus:9,sybase-rg,sybase-dg,hastorageplus_probe]: [ID 831072 daemon.notice] Issuing a resource restart request because of probe failures.
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 494478 daemon.notice] resource sybase-dg in resource group sybase-rg has requested restart of the resource on krn730.
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group sybase-rg state on node krn730 change to RG_ON_PENDING_R_RESTART
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_monitor_stop> for resource <sybase-dg>, resource group <sybase-rg>, node <krn730>, timeout <90> seconds
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hastorageplus_monitor_stop> completed successfully for resource <sybase-dg>, resource group <sybase-rg>, node <krn730>, time used: 0% of timeout <90 seconds>
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource sybase-dg state on node krn730 change to R_ONLINE_UNMON
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_UNKNOWN
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <Stopping>
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_postnet_stop> for resource <sybase-dg>, resource group <sybase-rg>, node <krn730>, timeout <1800> seconds
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hastorageplus_postnet_stop> completed successfully for resource <sybase-dg>, resource group <sybase-rg>, node <krn730>, time used: 0% of timeout <1800 seconds>
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource sybase-dg state on node krn730 change to R_OFFLINE
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_OFFLINE
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <>
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_prenet_start> for resource <sybase-dg>, resource group <sybase-rg>, node <krn730>, timeout <1800> seconds
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_UNKNOWN
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <Starting>
                    Mar 26 17:17:32 krn730 SC[,SUNW.HAStoragePlus:9,sybase-rg,sybase-dg,hastorageplus_prenet_start]: [ID 419839 daemon.error] Device service sybase associated with path sybase is Degraded.
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_DEGRADED
                    Mar 26 17:17:32 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <Device service sybase associated with path sybase is Degraded.>
                    Mar 26 17:17:33 krn730 SC[,SUNW.HAStoragePlus:9,sybase-rg,sybase-dg,hastorageplus_prenet_start]: [ID 100000 daemon.error]
                    Mar 26 17:17:33 krn730 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource sybase-dg status on node krn730 change to R_FM_FAULTED
                    Mar 26 17:17:33 krn730 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource sybase-dg status msg on node krn730 change to <>
                    Mar 26 17:17:33 krn730 Cluster.RGM.global.rgmd: [ID 938318 daemon.error] Method <hastorageplus_prenet_start> failed on resource <sybase-dg> in resource group <sybase-rg> [exit code <1>, time used: 0% of timeout <1800 seconds>]
                    Mar 26 17:17:33 krn730 Cluster.RGM.global.rgmd: [ID 443746 daemon.error] resource sybase-dg state on node krn730 change to R_START_FAILED


                    Regards,
                    Heinz
                    • 7. Re: SC 3.3: How to reinstall second node
                      807928
                      Did you try the other commands too?

                      Tim
                      ---
                      • 8. Re: SC 3.3: How to reinstall second node
                        924908
                        Yes, I did but there was no output.

                        The problem is, I don't know if there is a problem within the cluster or the metasets.
                        But when I manually mount my filesystems everything is ok.

                        In the worst-case I have to remove all cluster resources/groups/interconnects, remove all cluster nodes ( even the last ),
                        remove all sun cluster packages and reinstall the cluster framework from scratch. I havn't done that before!

                        When I boot the node that was offline ( ok prompt ) and the node joined the cluster than all resources going online.!

                        Regards,
                        Heinz
                        • 9. Re: SC 3.3: How to reinstall second node
                          807928
                          I don't really have many more suggestions. The only other thing that I might try is to remove from and re-add nodeA to the metaset. You'd need to shutdown nodeA to do the remove and then bring it up to perform the add. However, this is really just a shot in the dark. If this doesn't fix it, then I'd advise you to log a service ticket and have it looked at by support.

                          Regards,

                          Tim
                          ---
                          • 10. Re: SC 3.3: How to reinstall second node
                            924908
                            We opened a CASE and it seems that core patch 145333-11 is buggy, because with 145333-10 everything seems working fine.
                            I'll keep you informed.

                            Heinz