This discussion is archived
7 Replies Latest reply: Feb 11, 2013 12:03 AM by HartmutStreppel RSS

switching resource group in 2 node cluster fails

903538 Newbie
Currently Being Moderated
hi,
i configured a 2 node cluster to provide high availability for my oracle DB 9.2.0.7
i have created a resource and named it oracleha-rg,
and i crated later the following resources
oraclelh-rs for logical hostname
hastp-rs for the HA storage resource
oracle-server-rs for oracle resource
and listener-rs for listener

whenever i try to switch the resource group between nodes is gives me the following in dmesg:

+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hafoip_stop> for resource <oraclelh-rs>, resource group <oracleha-rg>, node <DB1>, timeout <300> seconds+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource oraclelh-rs status on node DB1 change to R_FM_UNKNOWN+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource oraclelh-rs status msg on node DB1 change to <Stopping>+
+Feb 6 16:17:49 DB1 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 010.050.033.009:0, remote = 000.000.000.000:0, start = -2, end = 6+
+Feb 6 16:17:49 DB1 ip: [ID 302654 kern.notice] TCP_IOC_ABORT_CONN: aborted 0 connection+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource oraclelh-rs status on node DB1 change to R_FM_OFFLINE+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource oraclelh-rs status msg on node DB1 change to <LogicalHostname offline.>+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hafoip_stop> completed successfully for resource <oraclelh-rs>, resource group <oracleha-rg>, node <DB1>, time used: 0% of timeout <300 seconds>+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource oraclelh-rs state on node DB1 change to R_OFFLINE+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_postnet_stop> for resource <hastp-rs>, resource group <oracleha-rg>, node <DB1>, timeout <1800> seconds+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource hastp-rs status on node DB1 change to R_FM_UNKNOWN+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource hastp-rs status msg on node DB1 change to <Stopping>+
+Feb 6 16:17:49 DB1 SC[,SUNW.HAStoragePlus:8,oracleha-rg,hastp-rs,hastorageplus_postnet_stop]: [ID 843127 daemon.warning] Extension properties FilesystemMountPoints and GlobalDevicePaths and Zpools are empty.+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <hastorageplus_postnet_stop> completed successfully for resource <hastp-rs>, resource group <oracleha-rg>, node <DB1>, time used: 0% of timeout <1800 seconds>+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 443746 daemon.notice] resource hastp-rs state on node DB1 change to R_OFFLINE+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 784560 daemon.notice] resource hastp-rs status on node DB1 change to R_FM_OFFLINE+
++Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 922363 daemon.notice] resource hastp-rs status msg on node DB1 change to ++
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.error] resource group oracleha-rg state on node DB1 change to RG_OFFLINE_START_FAILED+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB1 change to RG_OFFLINE+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not attempting to start resource group <oracleha-rg> on node <DB1> because this resource group has already failed to start on this node 2 or more times in the past 3600 seconds+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 447451 daemon.notice] Not attempting to start resource group <oracleha-rg> on node <DB2> because this resource group has already failed to start on this node 2 or more times in the past 3600 seconds+
+Feb 6 16:17:49 DB1 Cluster.RGM.global.rgmd: [ID 674214 daemon.notice] rebalance: no primary node is currently found for resource group <oracleha-rg>.+
+Feb 6 16:19:08 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource hastp-rs disabled.+
+Feb 6 16:19:17 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource oraclelh-rs disabled.+
+Feb 6 16:19:22 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource oracle-rs disabled.+
+Feb 6 16:19:27 DB1 Cluster.RGM.global.rgmd: [ID 603096 daemon.notice] resource listener-rs disabled.+
+Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB1 change to RG_OFF_PENDING_METHODS+
+Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB2 change to RG_OFF_PENDING_METHODS+
+Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 224900 daemon.notice] launching method <bin/oracle_listener_fini> for resource <listener-rs>, resource group <oracleha-rg>, node <DB1>, timeout <30> seconds+
+Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 515159 daemon.notice] method <bin/oracle_listener_fini> completed successfully for resource <listener-rs>, resource group <oracleha-rg>, node <DB1>, time used: 0% of timeout <30 seconds>+
+Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB1 change to RG_OFFLINE+
+Feb 6 16:19:51 DB1 Cluster.RGM.global.rgmd: [ID 529407 daemon.notice] resource group oracleha-rg state on node DB2 change to RG_OFFLINE+
and the resource group fails to switch...
any help please?
  • 1. Re: switching resource group in 2 node cluster fails
    Nik Expert
    Currently Being Moderated
    Hi.
    Feb 6 16:17:49 DB1 SC[,SUNW.HAStoragePlus:8,oracleha-rg,hastp-rs,hastorageplus_postnet_stop]: [ID 843127 daemon.warning] Extension properties FilesystemMountPoints and GlobalDevicePaths and Zpools are empty.

    It's look like uncorrect configured HAStoragePlus respurce.

    What type of FS you use ?

    Show configuration of hastp-rs resource.

    What FS and mount points used by Oracle ?


    Regards.
  • 2. Re: switching resource group in 2 node cluster fails
    903538 Newbie
    Currently Being Moderated
    hi thanks for the reply,
    i am using solaris 10 UFS and SVM

    i have created hastp-rs with the following command:
    # clrs create -g oracleha-rg -t oracle_server -p Resource_dependencies=hastp-rs -p Connect_string=rwmonitor/rwmonitor -p ORACLE_SID=RWDB -p ORACLE_HOME=/u01/app/oracle/product/9.2.0.7 -p Alert_log_file=/u01/app/oracle/admin/RWDB/bdump/alert_RWDB.log oracle-rs

    oracle data file system is monted on shared storag on /data1
    and the configuration file of oracle are under /u01/app/oracle (local disks)

    Bst rgards
  • 3. Re: switching resource group in 2 node cluster fails
    HartmutStreppel Explorer
    Currently Being Moderated
    THis was your oracle_server resource but not the HASP one.
  • 4. Re: switching resource group in 2 node cluster fails
    903538 Newbie
    Currently Being Moderated
    oh sorry :)
    here they are:

    extension properties:
    1) Zpools <NULL>
    2) FilesystemCheckCommand <NULL>
    3) FilesystemMountPoints /data1
    standard properties:
    Property Name Current Setting
    ============= ===============

    1) R_description Failover data service resource
    2) Resource_dependencies <NULL>
    3) Resource_dependencies_weak <NULL>
    4) Resource_dependencies_restart <NULL>
    5) Resource_dependencies_offline_restart <NULL>
    6) Retry_interval 300
    7) Retry_count 2
    8) Failover_mode SOFT
    9) POSTNET_STOP_TIMEOUT 1800
    10) PRENET_START_TIMEOUT 1800

    n) Next >
    q) Done

    Option: n

    Select the standard property you want to change:

    Property Name Current Setting
    ============= ===============

    11) MONITOR_CHECK_TIMEOUT 90
    12) MONITOR_STOP_TIMEOUT 90
    13) MONITOR_START_TIMEOUT 90
    14) INIT_TIMEOUT 1800
    15) UPDATE_TIMEOUT 1800
    16) VALIDATE_TIMEOUT 1800





    and

    root@DB1 # clrs show hastp-rs

    === Resources ===

    Resource: hastp-rs
    Type: SUNW.HAStoragePlus:8
    Type_version: 8
    Group: oracleha-rg
    R_description: Failover data service resource for SUNW.HAStoragePlus:8
    Resource_project_name: default
    Enabled{DB1}: True
    Enabled{DB2}: True
    Monitored{DB1}: True
    Monitored{DB2}: True

    Edited by: 900535 on Feb 7, 2013 4:42 AM
  • 5. Re: switching resource group in 2 node cluster fails
    Nik Expert
    Currently Being Moderated
    Hi.

    According provided config HASP monitor mount point /data1, but according oracle-resourse - it's used mount point /u01.

    Please show list of used FS:

    What FS used for oracle-data ( mountpoint list)
    What FS used for Oracle's binary ( mount point list).

    You place oracle's binary on shared disk or localy on every node ?

    Please show /etc/vfstab from every node.


    Regards.
  • 6. Re: switching resource group in 2 node cluster fails
    903538 Newbie
    Currently Being Moderated
    Hi,
    /u01 I use for oracle binaries which is located on local disks on each node, oracle data files will be written on the shared disks
    both of file types are ufs as per the vfstab below:

    root@DB1 # more /etc/vfstab
    +#device device mount FS fsck mount mount+
    +#to mount to fsck point type pass at boot options+
    +#+
    fd      -       /dev/fd fd      -       no      -
    +/proc - /proc proc - no -+
    +/dev/md/dsk/d1 - - swap - no -+
    +/dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no -+
    +/dev/md/dsk/d4 /dev/md/rdsk/d4 /var ufs 1 no -+
    +#/dev/md/dsk/d6 /dev/md/rdsk/d6 /globaldevices ufs 2 yes -+
    +/dev/md/dsk/d5 /dev/md/rdsk/d5 /opt/Roamware ufs 2 yes -+
    +/dev/md/dsk/d3 /dev/md/rdsk/d3 /u01 ufs 2 yes -+
    +/devices - /devices devfs - no -+
    sharefs -       /etc/dfs/sharetab       sharefs -       no      -
    ctfs    -       /system/contract        ctfs    -       no      -
    objfs   -       /system/object  objfs   -       no      -
    swap    -       /tmp    tmpfs   -       yes     -
    +/dev/md/dsk/d6 /dev/md/rdsk/d6 /global/.devices/node@1 ufs 2 no global+
    +/dev/md/oradbset/dsk/d140 /dev/md/oradbset/rdsk/d140 /data1 ufs 3 no -+

    root@DB2 # cat /etc/vfstab
    +#device device mount FS fsck mount mount+
    +#to mount to fsck point type pass at boot options+
    +#+
    fd      -       /dev/fd fd      -       no      -
    +/proc - /proc proc - no -+
    +/dev/md/dsk/d1 - - swap - no -+
    +/dev/md/dsk/d0 /dev/md/rdsk/d0 / ufs 1 no -+
    +/dev/md/dsk/d4 /dev/md/rdsk/d4 /var ufs 1 no -+
    +#/dev/md/dsk/d9 /dev/md/rdsk/d9 /globaldevices ufs 2 yes -+
    +/dev/md/dsk/d5 /dev/md/rdsk/d5 /opt/Roamware ufs 2 yes -+
    +/dev/md/dsk/d3 /dev/md/rdsk/d3 /u01 ufs 2 yes -+
    +/devices - /devices devfs - no -+
    sharefs -       /etc/dfs/sharetab       sharefs -       no      -
    ctfs    -       /system/contract        ctfs    -       no      -
    objfs   -       /system/object  objfs   -       no      -
    swap    -       /tmp    tmpfs   -       yes     -
    +/dev/md/dsk/d9 /dev/md/rdsk/d9 /global/.devices/node@2 ufs 2 no global+
    +/dev/md/oradbset/dsk/d400 /dev/md/oradbset/rdsk/d400 /data1 ufs 2 no -+


    could it be a database issue?
    i mean Oracle DB can startup normally on DB1 but when i run database startup on DB2, the database gives errors on control files and never get mounted?
    if that is the reason of resource group switching failure, how cluster could know that th database is having an error?
  • 7. Re: switching resource group in 2 node cluster fails
    HartmutStreppel Explorer
    Currently Being Moderated
    I must admit that I haven't worked with SVM for a while. So, what puzzles me is this:
    DB1: /dev/md/oradbset/dsk/d140 /dev/md/oradbset/rdsk/d140 /data1 ufs 3 no -
    DB2: /dev/md/oradbset/dsk/d400 /dev/md/oradbset/rdsk/d400 /data1 ufs 2 no -

    Is this metaset on shared storage? Don't you have to use Solaris Cluster did devices to create them? You should really check the docs on how to use SVM with Solaris Cluster on shared devices.

    In your initial post there was one additional thing, as you failover "testing" fell into the pingpong trap. You should check for Pingpong_interval in the docs. This mechanism prevents resource groups from bouncing between cluster nodes without a chance to get online anywhere.

    Regards

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points