1 Reply Latest reply on Apr 2, 2013 10:25 AM by moreajays

    Monitoring Problems by Jobs from DBMS_SCHEDULER

      Hi @ everybody,

      i have a script desinged, that is able to create fullbackups and incremental backups with rman. This script i have confiured with DBMS_SCHEDULER on an Oracle 11gR2 test instance on a CentOs 6.3 Server. It will run by hardlinks, this hardlinks can tell the script, if he has to create a fullbackup or an incremental backup. Everytime when a job was started i will get an email, when it is finished completely or has an error by executing.

      Now i want to monitor that jobs. For that i have created an SQL statement, that lokks into the table dba_scheduler_job_run_details and gives out the status of the last running job. In my monitoring software (Nagios 3) im checking this status, if it is "SUCCEEDED" or not. When it is not "SUCCEEDED" Nagios gives me a warning. Here is the SQL statement:
      SELECT STATUS FROM dba_scheduler_job_run_details WHERE LOG_DATE IN ( SELECT MAX(LOG_DATE) FROM dba_scheduler_job_run_details WHERE job_name LIKE '%BACKUP%');
      Now my problem: I was trying to occur some errors by running the script to test the monitoring. For example: i've moved the datafile of the undotbs tablespace. When i start the backup job manually in this scenario, i get an email with this error:
      Recovery Manager: Release - Production on Fri Mar 22 10:52:00 2013
      Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
      connected to target database: DBTEST2 (DBID=1232110510)
      executing command: SET compression
      using target database control file instead of recovery catalog
      RMAN> 2> 3> 4> 5> 6> 7> 
      allocated channel: oem_backup_disk1
      channel oem_backup_disk1: SID=21 device type=DISK
      Starting backup at 22-MAR-13
      RMAN-06169: could not read file header for datafile 3 error reason 5 released channel: oem_backup_disk1
      RMAN-00571: ===========================================================
      RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
      RMAN-00571: ===========================================================
      RMAN-03002: failure of backup command at 03/22/2013 10:52:02
      RMAN-06056: could not access datafile 3
      Recovery Manager complete.
      But now my problem is, that this job is enteret in the dba_scheduler_job_run_details not with an error! The status is "SUCCEEDED" also and "ERROR#" is 0. So i didn't get a alarm by Nagios, because in that state everything would be fine for him.

      No i'm trying to find out, what i could do. Mi ideas would be:

      - can i get out any error return code from the rman part in the script that i can use by monitoring with Nagios?


      - can i manipulate the dba_scheduler_job_run_details tablespace with a "normal" user so that the entry of the job has not STATUS='SUCCEEDED'?

      Thanks for ideas and help!

      Best regards,