3 Replies Latest reply: Apr 21, 2013 12:20 AM by KR10822864 RSS

    Oracle cluster RESTART_COUNT growing

    TasslehoffBurrfoot
      Hi all, I have a problema with an Oracle 10 instance configured as active/passive cluster with CRS, the storage is configured with ASM.
      Please be patient as I'm a simple sysadmin and I have almost no experience with Oracle db, specially with CRS and ASM configurations.

      The problem cause the dbms relocation from the active node to the passive node when the crs RESTART_COUNT parameter reaches RESTART_ATTEMPTS parameter value.
      The two dbms hosted on this Oracle instance are really stable and work very well, we are searching for the cause of this problem to prevent unmanaged relocates and interruptions.

      Having no experience with Oracle CRS we are monitoring these instances with a simple bash script that lauches crs_stat -v, parse the results and send an alert upon reaching a safety threshold, after that we schedule a programmed manual relocate to reset RESTART_COUNT parameter.

      This problem does not occur frequently but it's quite boring, we are searching for any log that can give us some clue to find the real problem and find a real solution.

      Any suggestion on some logs?
      I've already checked alert.log and the listener log but I found nothing useful, is there any CRS log?

      The Oracle instances (10.2.0.3.0 Enterprise) are installed on RedHat Linux ES 4.0 x64.

      Thanks for any info.

      Tas
        • 1. Re: Oracle cluster RESTART_COUNT growing
          onedbguru
          Go the ORACLE_HOME for the CRS and search for all of the logs you are looking for ohasd.log. You can ignore the install logs most of the logs will be 4-5 characters crsd.log, ohasd.log ctssd.log etc...

          I am guessing you are using active/passive for cost savings (ie not using RAC).
          • 2. Re: Oracle cluster RESTART_COUNT growing
            TasslehoffBurrfoot
            onedbguru wrote:
            Go the ORACLE_HOME for the CRS and search for all of the logs you are looking for ohasd.log. You can ignore the install logs most of the logs will be 4-5 characters crsd.log, ohasd.log ctssd.log etc...
            First of all, thanks for reply :)

            I found only the crsd.log under /u01/crs/oracle/product/10.2.0/crs/log/orclsrv16/crsd/crsd.log, orclsrv16 is the server name.
            On that log I found some records created during every RESTART_COUNT parameter increase, here is the last occurrence of the problem:
            -----
            2013-04-16 12:22:23.189: [  CRSAPP][1482860896]0CheckResource error for PORTAL.listener error code = 1
            2013-04-16 12:22:23.194: [  CRSRES][1482860896]0In stateChanged, PORTAL.listener target is ONLINE
            2013-04-16 12:22:23.194: [  CRSRES][1482860896]0PORTAL.listener on orclsrv16 went OFFLINE unexpectedly
            2013-04-16 12:22:23.194: [  CRSRES][1482860896]0StopResource: setting CLI values
            2013-04-16 12:22:23.211: [  CRSRES][1482860896]0Attempting to stop `PORTAL.listener` on member `orclsrv16`
            2013-04-16 12:22:32.740: [  CRSRES][1482860896]0Stop of `PORTAL.listener` on member `orclsrv16` succeeded.
            2013-04-16 12:22:32.740: [  CRSRES][1482860896]0PORTAL.listener RESTART_COUNT=36 RESTART_ATTEMPTS=100
            2013-04-16 12:22:32.740: [  CRSRES][1482860896]0PORTAL.listener Uptime does not exceed uptime_threshold
            2013-04-16 12:22:32.741: [  CRSRES][1482860896]0Restarting PORTAL.listener on orclsrv16
            2013-04-16 12:22:32.757: [  CRSRES][1482860896]0startRunnable: setting CLI values
            2013-04-16 12:22:32.757: [  CRSRES][1482860896]0Attempting to start `PORTAL.listener` on member `orclsrv16`
            2013-04-16 12:22:32.953: [  CRSRES][1482860896]0Start of `PORTAL.listener` on member `orclsrv16` succeeded.
            2013-04-16 12:22:32.955: [  CRSRES][1482860896]0Successfully restarted PORTAL.listener on orclsrv16, RESTART_COUNT=37
            2013-04-16 12:22:32.991: [  CRSRES][1482860896]0PORTAL.listener Updated LAST_RESTART time in ocr
            -----

            It seems that all starts with that CheckResource error on PORTAL.listener.
            PORTAL.listener is one of the HA resources returned by the crsstat command, I can presume it's related to the Oracle listener, is there any way to check that relationship?
            I mean is there any configuration file, parameter or any other way that may confirm that PORTAL.listener is the CRS name for the Oracle listener?

            I checked the listener log and I found an error with the same timestamp of the crsd.log error, now I searching for more details on that error.
            I am guessing you are using active/passive for cost savings (ie not using RAC).
            Well to be honest we don't searched for CRS, we need to migrate two db on a different Oracle server for infrastructure consolidation, those CRS instances were already available and licensed with a very low system load, so we "catch the opportunity".
            I wound prefer two indipendent Oracle instances on an A/P cluster with redhat cluster manager, lee integration with Oracle but easier to manage imho :)

            Edited by: 1000142 on 16-apr-2013 6.12
            • 3. Re: Oracle cluster RESTART_COUNT growing
              KR10822864
              Any suggestion on some logs?
              I've already checked alert.log and the listener log but I found nothing useful, is there any CRS log?

              The Oracle instances (10.2.0.3.0 Enterprise) are installed on RedHat Linux ES 4.0 x64.
              hope you may get clues from below logs in RAC env to do troubleshoot any prob related cluster db.

              $ORA_CRS_HOME/crs/log Contains trace files for the crs resources.you may get error details ....etc..
              $ORA_CRS_HOME/crs/init Contains trace files of the CRS daemon during startup and it may show any CRS login problems etc...
              $ORA_CRS_HOME/css/log The Cluster Synchronization (CSS) logs indicate all actions such as reconfigurations, missed check-ins, connects, and disconnects from the
              client CSS listener. In some cases, the logger logs messages with the category of auth.crit for the reboots done by Oracle. This could be used for checking the exact time when the reboot occurred.
              $ORA_CRS_HOME/css/init Contains core dumps from the Oracle Cluster Synchronization Service daemon (OCSSd) and the process ID (PID) for the CSS daemon whose
              death is treated as fatal. If abnormal restarts for CSS exist, the core files will have the format of core..etc...
              $ORA_CRS_HOME/srvm/log Log files for Oracle Cluster Registry , which contains the details at the Oracle cluster level.
              $ORA_CRS_HOME//log Log files cluster alert log, which contains diagnostic messages at the Oracle cluster level. This is available from Oracle database 10g r2.