12 Replies Latest reply on Apr 6, 2009 10:00 PM by 807567

    how to determin cause for resource failure when added

    807567
      I have a 2 node cluster that I am trying to add an oracle 10g database to. I have another cluster which is running 3 different instances and I have had no issues with them. However on this new cluster, I can not add an the oracle_server resource to monitor/manage the database itself. I have the diskgroups, VIP, and listener added, but when I try to add the database resource, it fails and the logs are not helpful in determining the cause. Can anyone advise on how to debug this or what other logs I can check. Here are the errors I get when trying to add the resource.

      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <bin/oracle_server_validate> for resource <axioprddb>, resource group <axioprd>, node <wwpxpsdb01>, timeout <120> seconds
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 616562 daemon.notice] 9 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscor/oracle_server/bin/oracle_server_validate>:tag=<axioprd.axioprddb.2>: Calling security_clnt_connect(..., host=<wwpxpsdb01>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <bin/oracle_server_validate> completed successfully for resource <axioprddb>, resource group <axioprd>, node <wwpxpsdb01>, time used: 0% of timeout <120 seconds>
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb01 change to RG_ON_PENDING_METHODS
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb01 change to R_PENDING_INIT
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb02 change to RG_OFF_PENDING_METHODS
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb02 change to R_PENDING_INIT
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb02 change to R_INITING
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb01 change to R_INITING
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <bin/oracle_server_init> for resource <axioprddb>, resource group <axioprd>, node <wwpxpsdb01>, timeout <30> seconds
      Jan 24 22:19:42 wwpxpsdb01 Cluster.RGM.rgmd: [ID 686647 daemon.notice] 43 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscor/oracle_server/bin/oracle_server_init>:tag=<axioprd.axioprddb.4>: Calling security_clnt_connect(..., host=<wwpxpsdb01>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
      Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <bin/oracle_server_init> completed successfully for resource <axioprddb>, resource group <axioprd>, node <wwpxpsdb01>, time used: 0% of timeout <30 seconds>
      Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb01 change to R_OFFLINE
      Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb01 change to RG_ONLINE
      Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb02 change to R_OFFLINE
      Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb02 change to RG_OFFLINE



      Here is the command I run to add the resource:
      clrs "create" -t "SUNW.oracle_server" -g "axioprd" -p "ORACLE_HOME=/axios/oracle/1020" -p "ORACLE_SID=axioprd" -p "connect_string=monitor/********" -p "Alert_log_file=/axios/oracle/admin/axioprd/bdump/alert_axioprd.log" -p "resource_dependencies=axioprddg" "axioprddb"


      Thanks in advance if anyone can provide any guidance.
        • 1. Re: how to determin cause for resource failure when added
          807567
          Here are the rest of the logs from the time of the addition to the failure, the SUN forum only allowed 5000 characters:


          Jan 24 22:19:43 wwpxpsdb01 Cluster.CCR: [ID 973933 daemon.notice] resource axioprddb added.
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 314356 daemon.notice] resource axioprddb enabled.
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb01 change to RG_PENDING_ONLINE
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb01 change to R_STARTING
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <bin/oracle_server_start> for resource <axioprddb>, resource group <axioprd>, node <wwpxpsdb01>, timeout <600> seconds
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource axioprddb status on node wwpxpsdb01 change to R_FM_UNKNOWN
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource axioprddb status msg on node wwpxpsdb01 change to <Starting>
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 686647 daemon.notice] 43 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscor/oracle_server/bin/oracle_server_start>:tag=<axioprd.axioprddb.0>: Calling security_clnt_connect(..., host=<wwpxpsdb01>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
          Jan 24 22:19:43 wwpxpsdb01 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource axioprddb status msg on node wwpxpsdb01 change to <>
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource axioprddb status on node wwpxpsdb01 change to R_FM_ONLINE
          Jan 24 22:19:44 wwpxpsdb01 SC[SUNWscor.oracle_server.start]:axioprd:axioprddb: [ID 690463 daemon.error] Cannot bring server online on this node.
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource axioprddb status on node wwpxpsdb01 change to R_FM_DEGRADED
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 938318 daemon.error] Method <bin/oracle_server_start> failed on resource <axioprddb> in resource group <axioprd> [exit code <1>, time used: 0% of timeout <600 seconds>]
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 443746 daemon.notice] resource axioprddb state on node wwpxpsdb01 change to R_START_FAILED
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 784560 daemon.notice] resource axioprddb status on node wwpxpsdb01 change to R_FM_FAULTED
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb01 change to RG_PENDING_ON_STARTED
          Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 529407 daemon.notice] resource group axioprd state on node wwpxpsdb01 change to RG_ONLINE
          • 2. Re: how to determin cause for resource failure when added
            807567
            In the eventlog, these are the entries I see when trying to perform the actions:

            Sat Jan 24 22:19:44 2009
            class: EC_Cluster subclass: ESC_cluster_fm_r_status_change
            vendor: SUNW pub: rgm pid: 6148
            cluster_id: 0x486084BC cluster: oracle_wprd node: wwpxpsdb01
            ts_sec: 1232860784 ts_usec: 553249
            severity: 0 (Info) initiator: 3 (Agent)
            r_name: axioprddb
            rg_name: axioprd node_name: wwpxpsdb01
            old_status: R_FM_ONLINE
            new_status: R_FM_DEGRADED
            status_msg:

            Sat Jan 24 22:19:44 2009
            class: EC_Cluster subclass: ESC_cluster_r_method_completed
            vendor: SUNW pub: rgm pid: 6148
            cluster_id: 0x486084BC cluster: oracle_wprd node: wwpxpsdb01
            ts_sec: 1232860784 ts_usec: 557272
            severity: 0 (Info) initiator: 1 (System)
            r_name: axioprddb
            rg_name: axioprd method_name: START
            method_duration: 1305063486 (1s 305ms)
            method_timeout: 600
            method_path: /opt/SUNWscor/oracle_server/bin/oracle_server_start
            result_code: 1 (Failed)

            Sat Jan 24 22:19:44 2009
            class: EC_Cluster subclass: ESC_cluster_r_state
            vendor: SUNW pub: rgm pid: 6148
            cluster_id: 0x486084BC cluster: oracle_wprd node: wwpxpsdb01
            ts_sec: 1232860784 ts_usec: 557772
            severity: 3 (Critical) initiator: 1 (System)
            r_name: axioprddb
            rg_name: axioprd node_list: wwpxpsdb02,wwpxpsdb01
            state_list: OFFLINE,START_FAILED

            Sat Jan 24 22:19:44 2009
            class: EC_Cluster subclass: ESC_cluster_fm_r_status_change
            vendor: SUNW pub: rgm pid: 6148
            cluster_id: 0x486084BC cluster: oracle_wprd node: wwpxpsdb01
            ts_sec: 1232860784 ts_usec: 558496
            severity: 0 (Info) initiator: 3 (Agent)
            r_name: axioprddb
            rg_name: axioprd node_name: wwpxpsdb01
            old_status: R_FM_DEGRADED
            new_status: R_FM_FAULTED
            status_msg:
            • 3. Re: how to determin cause for resource failure when added
              651292
              Hi,
              this message tells you where the problem is, but not what it is:
              Jan 24 22:19:44 wwpxpsdb01 SC[SUNWscor.oracle_server.start]:axioprd:axioprddb: [ID 690463 daemon.error] Cannot bring server online on this node.
              Jan 24 22:19:44 wwpxpsdb01 Cluster.RGM.rgmd: [ID 938318 daemon.error] Method <bin/oracle_server_start> failed on resource <axioprddb> in resource group <axioprd> [exit code <1>, time used: 0% of timeout <600 seconds>]
              2 ways to proceed:
              - try to bring up Oracle manually using the same parameters that you used to configure it under cluster control.
              - have a look at the Oracle log files. There you should be able to find the root cause of the problem

              Regards
              hartmut
              • 4. Re: how to determin cause for resource failure when added
                807567
                Thank you for the reply Hartmut, and sorry for taking so long to respond. I can start the database manually with the same parameters that are passed to it from within solaris cluster. I took your advice and asked our DBA's to check the Oracle logs, and it appears that the database is opened, and then the next log entry 1 second later is that the database is being shutdown.

                Starting background process QMNC
                QMNC started with pid=29, OS id=25025
                Sat Feb 21 21:45:24 2009
                Completed: ALTER DATABASE OPEN
                Sat Feb 21 21:50:29 2009
                Shutting down instance: further logons disabled
                Sat Feb 21 21:50:30 2009
                Stopping background process QMNC
                Sat Feb 21 21:50:30 2009
                Stopping background process CJQ0
                Sat Feb 21 21:50:32 2009
                Stopping background process MMNL
                Sat Feb 21 21:50:33 2009
                Stopping background process MMON
                Sat Feb 21 21:50:34 2009
                Shutting down instance (immediate)


                Does anyone have any suggestions on why this may be happening, or any other way I can troubleshoot this? I have 4 other instance running on this cluster, all configured the exact same way, but this one database is having issues on both nodes. It doesn't seem like an oracle issue based on the logs, and I can disable the database resource and everything goes online. I only have one dependency for the oracle database, and that is the filesystems, which are definitely online.
                • 5. Re: how to determin cause for resource failure when added
                  807816
                  Are any of the processes dumping out diagnostics, i.e. trc files? May be the system is running out of some resource like swap trying to open this database. Are there any accompanying messages in /var/adm/messages?

                  Tim
                  ---
                  • 6. Re: how to determin cause for resource failure when added
                    807567
                    There is nothing helpful in the trace files and the log entries above are from the alert log, with nothing more after the shutdown log entries. I was unable to test this for quite some time, and just this past weekend we had a large maintenance window. This was the first instance I tried to start in the cluster, so it shouldn't be a memory/swap issue, this is a fairly small instance running on a fresh booted domain with 3 system boards (96Gb Mem, 12 CPU, and 96Gb Swap). If I enable the SNMP event MIB, will that provide any additional info, or will I just see the same entries found in the cluster binary event log? Any suggestions on a Dtrace script that may point something out?

                    In VCS you can make a resource non-critical, so that if the monitor fails, it wouldn't shut anything down. This would allow you to start a database manually and troubleshoot the resource without impacting anything. Can I do the same with Solaris Cluster, can I make this DB resource not shutdown and/or fail the rest of the resources in the group, if I enable it. Basically I'm hoping to enable the DB resource and see the SNMP events that are generated as a result of the monitors and such.

                    Would this be the property to modify?

                    Failover_mode: SOFT
                    Class: standard
                    Description: Modifies recovery actions taken when the resource fails.
                    Type: enum


                    Just noticed this as well, anyone think this may help, if increased?

                    Debug_level: 1
                    Class: extension
                    Description: Debug level for logging messages
                    Per-node: False
                    Type: int


                    As always, thanks in advance for any assistance or suggestions.
                    • 7. Re: how to determin cause for resource failure when added
                      807816
                      When you tried to start it up manually did you make sure that the Sun Cluster resources were disabled and the resource group offline. I just want to make sure that we aren't getting any interference from that.

                      In Solaris Cluster you can turn off the monitoring on a resource which then allows you to administer the underlying application without the fault monitor kicking in and trying to restart your application. Of course, this relies on the fact that the start method can actually start the resource in the first place. If it can't you won't be able to enable the resource when the RG is online.

                      So, you need to be able to get to the bottom of why, even with the SC RG offline, you cannot start Oracle. If this is correct, then shared memory configuration, is the first think that springs to mind.

                      Regards,

                      Tim
                      ---
                      • 8. Re: how to determin cause for resource failure when added
                        807567
                        I must have made a typo somewhere Tim, and didn't realize it. I can start the database manually with the same parameters that the cluster uses, if I said in any post that I couldn't, then it was a typo and I apologize. Currently I have the resource unmonitored, with the database up and running from a manual start. At this point, I am hoping I can make the resource not trigger a failover or stop the database if I enable the monitor and it fails, in hopes that I can have the monitor run, see it fail again, and debug it from there.
                        • 9. Re: how to determin cause for resource failure when added
                          807816
                          Did you try setting debug_level = 9 on the Oracle server resource, as per the manual? That should provide you with more info. If the Oracle database can start up outside the cluster on both nodes, then it would suggest that something about the dependency or the file system configuration, possibly in /etc/vfstab on both nodes, is wrong.

                          Regards,

                          Tim
                          ---
                          • 10. Re: how to determin cause for resource failure when added
                            807567
                            I have not changed the debug level yet, but that is what I was hoping would help. According the docs I read, the value can be 1 - 100, but you mention a value of 9. I'm guessing that if I provide an out of range value, it will just error out, but since this is a production instance and highly visible at that, can you please confirm. Also, can I safely change the value of failover_mode to log_only, and safely start monitoring the resource again. Basically my goal is to have the instance running manually as it is now, then enable the monitor which should fail, and log debug messages but not stop the database. Can I achieve this, and is this the process to follow without impact?

                            clrs set -p "Debug_level=9" -p "Failover_mode=LOG_ONLY" $RESOURCE
                            clrs enable $RESOURCE


                            I assume Debug_level is adjustable at any time, but am guessing maybe not with Failover_mode. Sorry for all the questions, I am trying to get authorization for a development cluster so I can actually test things like this and not have to ask so many questions, but thank you for the help thus far. I have reviewed many docs and read the r_properties manpage, I am just hoping someone can confirm my understanding of the same or suggest the proper way to go about it.
                            • 11. Re: how to determin cause for resource failure when added
                              807816
                              If this is a production system you should really open a case with Sun Service to track this issue and get more timely support. This forum is not suppose to replace that formal support channel.

                              I had not looked at the docs in a while and debug_level=9 was what I remembered, but you are correct it goes up to 100.

                              From my reading of the docs, setting failover_mode to log_only should achieve what you want but I can't guarantee it. If the start method wasn't completing before when running this instance under SC, then it is unlikely this will succeed since the start method may not complete. It might be possible to start the DB outside the framework and then change this setting and enable the resource. I think that way the fault monitor will just try and probe the DB and report any errors without trying to restart it.

                              As I say, this is my understanding. I don't have the time to try this out on a test cluster as I need to get on with development work!

                              I hope that's of some help,

                              Tim
                              ---
                              • 12. Re: how to determin cause for resource failure when added
                                807567
                                Just in case anyone else ever has this same issue, this past weekend the database was upgraded from 10.2.0.1 to 10.2.0.4, and after the upgrade, everything was working fine. Nothing was modified in SC and on the database side no configuration was changed, all env settings and path locations remained the same. Maybe it was just coincidence, but as strange as it may sound, it seems like the issue may be some sort of bug. If you feel like you have tried everything to no avail like I did, check the oracle version and if 10.2.0.1, you may be seeing the same problem that I was.