7 Replies Latest reply: Feb 11, 2013 2:03 AM by HartmutStreppel RSS

    switching resource group in 2 node cluster fails

    903538
      hi,
      i configured a 2 node cluster to provide high availability for my oracle DB 9.2.0.7
      i have created a resource and named it oracleha-rg,
      and i crated later the following resources
      oraclelh-rs for logical hostname
      hastp-rs for the HA storage resource
      oracle-server-rs for oracle resource
      and listener-rs for listener

      whenever i try to switch the resource group between nodes is gives me the following in dmesg:

      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hafoip_stop> for resource <oraclelh-rs>, resource group <oracleha-rg>, node <DB1>, timeout <300> seconds+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource oraclelh-rs status on node DB1 change to R_FM_UNKNOWN+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource oraclelh-rs status msg on node DB1 change to <Stopping>+
      +Feb 6 16:17:49 DB1 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 010.050.033.009:0, remote = 000.000.000.000:0, start = -2, end = 6+
      +Feb 6 16:17:49 DB1 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN: aborted 0 connection+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource oraclelh-rs status on node DB1 change to R_FM_OFFLINE+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource oraclelh-rs status msg on node DB1 change to <LogicalHostname offline.>+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hafoip_stop> completed successfully for resource <oraclelh-rs>, resource group <oracleha-rg>, node <DB1>, time used: 0% of timeout <300 seconds>+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource oraclelh-rs state on node DB1 change to R_OFFLINE+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_postnet_stop> for resource <hastp-rs>, resource group <oracleha-rg>, node <DB1>, timeout <1800> seconds+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource hastp-rs status on node DB1 change to R_FM_UNKNOWN+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource hastp-rs status msg on node DB1 change to <Stopping>+
      +Feb 6 16:17:49 DB1 SC[,SUNW.HAStoragePlus:8,oracleha-rg,hastp-rs,hastorageplus_postnet_stop]: [ID 843127 daemon.warning] Extension properties FilesystemMountPoints and GlobalDevicePaths and Zpools are empty.+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hastorageplus_postnet_stop> completed successfully for resource <hastp-rs>, resource group <oracleha-rg>, node <DB1>, time used: 0% of timeout <1800 seconds>+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource hastp-rs state on node DB1 change to R_OFFLINE+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource hastp-rs status on node DB1 change to R_FM_OFFLINE+
      ++Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource hastp-rs status msg on node DB1 change to ++
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.error] resource group oracleha-rg state on node DB1 change to RG_OFFLINE_START_FAILED+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB1 change to RG_OFFLINE+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not attempting to start resource group <oracleha-rg> on node <DB1> because this resource group has already failed to start on this node 2 or more times in the past 3600 seconds+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not attempting to start resource group <oracleha-rg> on node <DB2> because this resource group has already failed to start on this node 2 or more times in the past 3600 seconds+
      +Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 674214 daemon.notice] rebalance: no primary node is currently found for resource group <oracleha-rg>.+
      +Feb 6 16:19:08 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource hastp-rs disabled.+
      +Feb 6 16:19:17 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource oraclelh-rs disabled.+
      +Feb 6 16:19:22 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource oracle-rs disabled.+
      +Feb 6 16:19:27 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource listener-rs disabled.+
      +Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB1 change to RG_OFF_PENDING_METHODS+
      +Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB2 change to RG_OFF_PENDING_METHODS+
      +Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <bin/oracle_listener_fini> for resource <listener-rs>, resource group <oracleha-rg>, node <DB1>, timeout <30> seconds+
      +Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <bin/oracle_listener_fini> completed successfully for resource <listener-rs>, resource group <oracleha-rg>, node <DB1>, time used: 0% of timeout <30 seconds>+
      +Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB1 change to RG_OFFLINE+
      +Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB2 change to RG_OFFLINE+
      and the resource group fails to switch...
      any help please?
        • 1. Re: switching resource group in 2 node cluster fails
          Nik
          Hi.
          Feb 6 16:17:49 DB1 SC[,SUNW.HAStoragePlus:8,oracleha-rg,hastp-rs,hastorageplus_postnet_stop]: [ID 843127 daemon.warning] Extension properties FilesystemMountPoints and GlobalDevicePaths and Zpools are empty.

          It's look like uncorrect configured HAStoragePlus respurce.

          What type of FS you use ?

          Show configuration of hastp-rs resource.

          What FS and mount points used by Oracle ?


          Regards.
          • 2. Re: switching resource group in 2 node cluster fails
            903538
            hi thanks for the reply,
            i am using solaris 10 UFS and SVM

            i have created hastp-rs with the following command:
            # clrs create -g oracleha-rg -t oracle_server -p Resource_dependencies=hastp-rs -p Connect_string=rwmonitor/rwmonitor -p ORACLE_SID=RWDB -p ORACLE_HOME=/u01/app/oracle/product/9.2.0.7 -p Alert_log_file=/u01/app/oracle/admin/RWDB/bdump/alert_RWDB.log oracle-rs

            oracle data file system is monted on shared storag on /data1
            and the configuration file of oracle are under /u01/app/oracle (local disks)

            Bst rgards
            • 3. Re: switching resource group in 2 node cluster fails
              HartmutStreppel
              THis was your oracle_server resource but not the HASP one.
              • 4. Re: switching resource group in 2 node cluster fails
                903538
                oh sorry :)
                here they are:

                extension properties:
                1) Zpools <NULL>
                2) FilesystemCheckCommand <NULL>
                3) FilesystemMountPoints /data1
                standard properties:
                Property Name Current Setting
                ============= ===============

                1) R_description Failover data service resource
                2) Resource_dependencies <NULL>
                3) Resource_dependencies_weak <NULL>
                4) Resource_dependencies_restart <NULL>
                5) Resource_dependencies_offline_restart <NULL>
                6) Retry_interval 300
                7) Retry_count 2
                8) Failover_mode SOFT
                9) POSTNET_STOP_TIMEOUT 1800
                10) PRENET_START_TIMEOUT 1800

                n) Next >
                q) Done

                Option: n

                Select the standard property you want to change:

                Property Name Current Setting
                ============= ===============

                11) MONITOR_CHECK_TIMEOUT 90
                12) MONITOR_STOP_TIMEOUT 90
                13) MONITOR_START_TIMEOUT 90
                14) INIT_TIMEOUT 1800
                15) UPDATE_TIMEOUT 1800
                16) VALIDATE_TIMEOUT 1800





                and

                root@DB1 # clrs show hastp-rs

                === Resources ===

                Resource: hastp-rs
                Type: SUNW.HAStoragePlus:8
                Type_version: 8
                Group: oracleha-rg
                R_description: Failover data service resource for SUNW.HAStoragePlus:8
                Resource_project_name: default
                Enabled{DB1}: True
                Enabled{DB2}: True
                Monitored{DB1}: True
                Monitored{DB2}: True

                Edited by: 900535 on Feb 7, 2013 4:42 AM
                • 5. Re: switching resource group in 2 node cluster fails
                  Nik
                  Hi.

                  According provided config HASP monitor mount point /data1, but according oracle-resourse - it's used mount point /u01.

                  Please show list of used FS:

                  What FS used for oracle-data ( mountpoint list)
                  What FS used for Oracle's binary ( mount point list).

                  You place oracle's binary on shared disk or localy on every node ?

                  Please show /etc/vfstab from every node.


                  Regards.
                  • 6. Re: switching resource group in 2 node cluster fails
                    903538
                    Hi,
                    /u01 I use for oracle binaries which is located on local disks on each node, oracle data files will be written on the shared disks
                    both of file types are ufs as per the vfstab below:

                    root@DB1 # more /etc/vfstab
                    +#device device mount FS fsck mount mount+
                    +#to mount to fsck point type pass at boot options+
                    +#+
                    fd      -       /dev/fd fd      -       no      -
                    +/proc - /proc proc - no -+
                    +/dev/md/dsk/d1 - - swap - no -+
                    +/dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no -+
                    +/dev/md/dsk/d4 /dev/md/rdsk/d4 /var ufs 1 no -+
                    +#/dev/md/dsk/d6 /dev/md/rdsk/d6 /globaldevices ufs 2 yes -+
                    +/dev/md/dsk/d5 /dev/md/rdsk/d5 /opt/Roamware ufs 2 yes -+
                    +/dev/md/dsk/d3 /dev/md/rdsk/d3 /u01 ufs 2 yes -+
                    +/devices - /devices devfs - no -+
                    sharefs -       /etc/dfs/sharetab       sharefs -       no      -
                    ctfs    -       /system/contract        ctfs    -       no      -
                    objfs   -       /system/object  objfs   -       no      -
                    swap    -       /tmp    tmpfs   -       yes     -
                    +/dev/md/dsk/d6 /dev/md/rdsk/d6 /global/.devices/node@1 ufs 2 no global+
                    +/dev/md/oradbset/dsk/d140 /dev/md/oradbset/rdsk/d140 /data1 ufs 3 no -+

                    root@DB2 # cat /etc/vfstab
                    +#device device mount FS fsck mount mount+
                    +#to mount to fsck point type pass at boot options+
                    +#+
                    fd      -       /dev/fd fd      -       no      -
                    +/proc - /proc proc - no -+
                    +/dev/md/dsk/d1 - - swap - no -+
                    +/dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no -+
                    +/dev/md/dsk/d4 /dev/md/rdsk/d4 /var ufs 1 no -+
                    +#/dev/md/dsk/d9 /dev/md/rdsk/d9 /globaldevices ufs 2 yes -+
                    +/dev/md/dsk/d5 /dev/md/rdsk/d5 /opt/Roamware ufs 2 yes -+
                    +/dev/md/dsk/d3 /dev/md/rdsk/d3 /u01 ufs 2 yes -+
                    +/devices - /devices devfs - no -+
                    sharefs -       /etc/dfs/sharetab       sharefs -       no      -
                    ctfs    -       /system/contract        ctfs    -       no      -
                    objfs   -       /system/object  objfs   -       no      -
                    swap    -       /tmp    tmpfs   -       yes     -
                    +/dev/md/dsk/d9 /dev/md/rdsk/d9 /global/.devices/node@2 ufs 2 no global+
                    +/dev/md/oradbset/dsk/d400 /dev/md/oradbset/rdsk/d400 /data1 ufs 2 no -+


                    could it be a database issue?
                    i mean Oracle DB can startup normally on DB1 but when i run database startup on DB2, the database gives errors on control files and never get mounted?
                    if that is the reason of resource group switching failure, how cluster could know that th database is having an error?
                    • 7. Re: switching resource group in 2 node cluster fails
                      HartmutStreppel
                      I must admit that I haven't worked with SVM for a while. So, what puzzles me is this:
                      DB1: /dev/md/oradbset/dsk/d140 /dev/md/oradbset/rdsk/d140 /data1 ufs 3 no -
                      DB2: /dev/md/oradbset/dsk/d400 /dev/md/oradbset/rdsk/d400 /data1 ufs 2 no -

                      Is this metaset on shared storage? Don't you have to use Solaris Cluster did devices to create them? You should really check the docs on how to use SVM with Solaris Cluster on shared devices.

                      In your initial post there was one additional thing, as you failover "testing" fell into the pingpong trap. You should check for Pingpong_interval in the docs. This mechanism prevents resource groups from bouncing between cluster nodes without a chance to get online anywhere.

                      Regards