1 Reply Latest reply on Mar 24, 2014 3:36 PM by EmaxG

    Replicating Million rows takes long time at replicating (OGG 11.2.1.0.11)

    divakar.x.mehta-Oracle

      Hello OGG Experts,

      Customer is using OGG 11.2.1.0.11 on 2 Node 11.2.0.3 RAC (solaris SPARC using global zones) . They hit an issue last weekend when they had to delete 2 MILLION ROWS from one of the tables being replicated in that the process took 48 hours to completely propagate to target (source deletes committed in about 3 hours)

      -------
      Issue# 1 ->
      -------

      Issue was Even after the source deletes had been committed, the extract "Lag at Chkpt" and "Time Since Chkpt" are very behind - eg if 1-2 million deletes take three hours  on source and all deletes have been performed and committed at source, then for as long as the replication of deletes goes on (which can be several times the length of the source deletes), then these metrics show a huge lag.

      -------
      Issue# 2 ->
      -------

      Also during this time the Bounder recovery (BR) checkpointing on Source increased beyond the default 4 hour window (before the delete the BR Checkpoints were happening every 4 hours as expected) .
      Before the deletes the "gg error log" shows BR happening every 4 hours. During the deletes BR seems to cease - you see no reporting of it in the gg error log. Also the "Time Since Chkpt" never gets reset every 4 hours (default BR interval).

      -------
      Issue# 3->
      -------

      Also If you insert more records into the source DB (post the deletes), the extract trail files do get created on the source however Despite this the "showch" output show the current trail as way behind - at the start of the deletes and doesn't move

      Also the new INSERTS never get propagated to TARGET until the deletes have finished (new inserts are only replicated after around 48 hours!) so they are kind of backlogged 

      I have replicated the same on my Internal OGG environment wherein the delete of a million rows took 12 mins on SOURCE Extract end and to propagate the same to pump trails and finally to relicat and apply to target took 20 hours

      During this process I observed (1 million row deletes at source extract was carried out around 12:12 PM , 17th March 2014) ->>


      1) Looking at 'Time Since Chkpt' column value from 'info all' command at SOURCE EXTRACT End during the delete propagation


      Program     Status      Group       Lag at Chkpt  Time Since Chkpt

      MANAGER     RUNNING
      EXTRACT     RUNNING     EORGRG      00:00:00      17:52:31    >>> shows checkpoint lag since the commit timestamp of 12:12 PM so 2:30 hrs behind >>>
      EXTRACT     RUNNING     PORGRG      00:00:00      17:52:31    >>> shows checkpoint lag since the commit timestamp of 12:12 PM so 2:30 hrs behind >>>
      REPLICAT    RUNNING     RORGRG      00:00:00      00:00:06



      GGSCI (esscracdgc1) 7> info er * , showch


      EXTRACT    EORGRG    Last Started 2014-03-14 18:04   Status RUNNING
      Checkpoint Lag       00:00:00 (updated 17:52:32 ago)
      Log Read Checkpoint  Oracle Redo Logs
                           2014-03-17 12:12:37  Thread 1, Seqno 299, RBA 7785984    >>> log read checkpoint from time of commit 17 hrs behind>>
                            SCN 0.8451396 (8451396)
      Log Read Checkpoint  Oracle Redo Logs
                            2014-03-11 13:49:21  Thread 2, Seqno 0, RBA 0
                            SCN 0.0 (0)

      ..
      ..



      EXTRACT EORGRG (PID 15127)
         Current status: Recovery complete: Processing data

         Current read positions:
         Redo thread #: 1
         Sequence #: 299
         RBA: 7785984
         Timestamp: 2014-03-17 12:12:37.000000
         SCN: 0.8451396
         Redo thread #: 2
         Sequence #: 0
         RBA: 0
         Timestamp: 2014-03-11 13:49:21.000000
         SCN: 0.0
         Current write position:
        Sequence #: 24                >> checkpoint still at seq# 24 even though we have trails upto 33 generated at source
         RBA: 2027082
         Timestamp: 2014-03-17 12:12:51.148077
         Extract Trail: ./dirdat/lb




      >> Once extract had done its of job of writting to trails the checkpoint moved at EXTRACT end >>


      Program     Status      Group       Lag at Chkpt  Time Since Chkpt

      MANAGER     RUNNING                                          
      EXTRACT     RUNNING     EORGRG      17:52:36      00:00:10            >>> extract checkpointed now as it had written all to trail
      EXTRACT     RUNNING     PORGRG      00:00:00      17:52:49            >>> pump was still pushing to replicat on target

      REPLICAT    RUNNING     RORGRG      00:00:00      00:00:00



      2) From " info er * , showch" I could see  'Checkpoint Lag' creeping up at replicat end


      Program     Status      Group       Lag at Chkpt  Time Since Chkpt

      MANAGER     RUNNING                                          
      EXTRACT     RUNNING     EORGRG      00:00:00      00:00:01   
      EXTRACT     RUNNING     PORGRG      00:00:00      00:00:03   
      REPLICAT    RUNNING     RORGRG      00:00:00      20:07:30     >> Lag


      3)  Last BR Checkpoint was 11:12 (before the delete went in) and no BR checkpoints since (now 5 hours) >>


      # grep -i BOUND ggserr.log |tail -10


      2014-03-17 11:12:13
        INFO    OGG-01738  Oracle GoldenGate Capture for Oracle, EORGRG.prm:  BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p6339_Redo Thread 1: start=SeqNo: 289, RBA: 572944, SCN: 0.8409873 (8409873), Timestamp: 2014-03-17 11:12:12.000000, Thread: 1, end=SeqNo: 289, RBA: 573440, SCN: 0.8409873 (8409873), Timestamp: 2014-03-17 11:12:12.000000, Thread: 1.
      2014-03-17 11:12:13  INFO    OGG-01738  Oracle GoldenGate Capture for Oracle, EORGRG.prm:  BOUNDED RECOVERY: CHECKPOINT: for object pool 2: p6339_Redo Thread 2:.

      << no entries seen in between >>

      << We CAN SEE BOUNDED RECOVERY CHECKPOINTING ALSO SUFFERED AS IT COULDN'T TAKE A CHECKPOINT FROM 2014-03-17 11:12:13 (pre delete) ..to.. 2014-03-18 06:05:30 (when extract moved checkpoint forward) AS IT WAS DOING EVERY 4 HOURS >>>>

      2014-03-18 06:05:30 INFO    OGG-01738  Oracle GoldenGate Capture for Oracle, EORGRG.prm:  BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p6339_Redo Thread 1: start=SeqNo: 299, RBA: 11355664, SCN: 0.8455516 (8455516), Timestamp: 2014-03-17 13:00:38.000000, end=SeqNo: 304, RBA: 7022592, SCN: 0.8580836 (8580836), Timestamp: 2014-03-18 06:05:27.000000, Thread: 1.


      ==> Question really is  whether OGG can cope with a Million row deletes well (is there a better way to tackle this) and whether above checkpoint lag issues seen above are expected ( I think it is expected as all OGG processes were lagging behind each other)

      Any advise would be more than useful

      Thanks,
      Divakar