1 Reply Latest reply on Feb 27, 2012 2:51 PM by "Andrei Costache, Oracle-Oracle"

    Replication crashed within recovery  -- 10.000 databases (122GB data)

      Hi guys,

      we are using an older BDB version 4.8.30 with about 10000 databases and about 120GB data.
      Currently we are running in master only mode (because replica not startable). We like to run with 1 master/1 replica .

      - LINUX, Java6 with BDB-C lib
      - BDB 4.8.30
      - 10000 databases with ~120GB data
      - oldest db is 24h
      - checkpointing is running every 5min
      - db_verify over all databases are successful (no currupt databases)

      1. Start Replica
      2. Start Master
      --> Replication is starting (timestamp1) and needs ~3 hours (120GB + log files)
      --> Copy of databases (oldest DB is 1 day old) + logfiles done started
      --> new data are incoming and written to the databases
      --> Copy of databases + logfiles to the replica are done (~3 hours after timestamp1).
      --> Replica crashed with following log output:

      +2012-02-17 21:20:35,213[Thread-839][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;31;Log sequence error: page LSN 0 0; previous LSN 110226 81585213+
      +2012-02-17 21:20:35,214[Thread-840][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;32;Recovery function for LSN 110704 84346136 failed on forward pass+
      +2012-02-17 21:20:35,289[Thread-841][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;33;Client initialization failed. Need to manually restore client+
      +2012-02-17 21:20:35,289[Thread-842][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;34;PANIC: Invalid argument+
      +2012-02-17 21:20:35,308[Thread-844][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;35;PANIC: fatal region error detected; run recovery+
      +2012-02-17 21:20:35,308[Thread-845][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;36;PANIC: fatal region error detected; run recovery+
      +2012-02-17 21:20:35,320[Thread-847][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;37;RunRecoveryException:DB_RUNRECOVERY: Fatal error, run database recovery+
      +2012-02-17 21:20:35,320[Thread-843][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;39;IllegalArgumentException:Invalid argument+
      +2012-02-17 21:20:35,321[Thread-849][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;40;DB_ENV->rep_process_message: DB_RUNRECOVERY: Fatal error, run database recovery+
      +2012-02-17 21:20:35,322[Thread-850][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;41;message thread failed: DB_RUNRECOVERY: Fatal error, run database recovery+
      +2012-02-17 21:20:35,322[Thread-851][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;42;PANIC: DB_RUNRECOVERY: Fatal error, run database recovery+
      +2012-02-17 21:20:35,323[Thread-852][ERROR, com.mywork.test.remote.server.impl.CommonJMXErrorListener] DB;2012-02-17T20:20:35Z;43;RunRecoveryException:DB_RUNRECOVERY: Fatal error, run database recovery+
      +2012-02-17 21:20:35,328[Thread-852][ERROR, com.mywork.test.dataaccess.bdbc.impl.ReplicationEnvironment] DATABASE ENVIRONMENT PANICS, HALTING SYSTEM+
      1. The oldest DB is 24h old --> and the missing log file points to 3days in the past. Why?
      2. Does db_checkpoint -1 makes sence - to need only the lastest log file? Is this checkable with db_archive?
      3. Is starting db_load -r lsn a feasable solution? Is it possible to run it on running system (open environment)?

      Thx -- Stefan

      Edited by: Stefan W. on Feb 22, 2012 9:54 AM

      Edited by: Stefan W. on Feb 23, 2012 9:49 AM