1 Reply Latest reply: Dec 21, 2010 10:52 PM by 825850 RSS

    Memory issue on replica client

    660988
      I am using bdb 4.7.25 on freebsd 7.0 C++ api.

      I have applied patch from Link: Re: Question on replication error like "DB_ENV->rep_process_message: DB_NOTF.." to fix log_archive issue. I have also applied the patch suggested in the reply to above message.

      On master node, I am doing lot of write operation with periodic checkpointing.

      Case 1:
      =======

      Later, when master node archives (deletes) log files after checkpointing, after few minutes of transaction, I get following error on client node.

      -----
      Log sequence error: page LSN 0 0; previous LSN 25 1048356
      Recovery function for LSN 26 4263441 failed on forward pass
      Client initialization failed. Need to manually restore client
      PANIC: Invalid argument
      DB_ENV->rep_process_message: DB_RUNRECOVERY: Fatal error, run database recovery
      message thread failed: DB_RUNRECOVERY: Fatal error, run database recovery
      PANIC: fatal region error detected; run recovery
      DB_ENV->rep_process_message: DB_RUNRECOVERY: Fatal error, run database recovery
      message thread failed: DB_RUNRECOVERY: Fatal error, run database recovery
      PANIC: DB_RUNRECOVERY: Fatal error, run database recovery
      PANIC: DB_RUNRECOVERY: Fatal error, run database recovery
      -----

      Please advice, what could be possibly wrong and how can I fix it?


      Case 2:
      =====

      On similar instances, when I dont do log_archive'ing on master node to delete the log file, the memory footprint of client process periodically increases a lot and then decreases back to normal. I suspect this happens around the checkpointing, where master sends burst of messages to client to replicate. But gradually the footprint increases too high and starts using swap space and there is not enough memory to allocate. Is this fluctuation of memory footprint on client node an expected behaviour?

      Potentially following output for db_stat-4.7 -MA might help.

      This is the statistics from the replica client node machine.

      -----
      Mpool REGINFO information:
      Mpool Region type
      3 Region ID
      __db.003 Region name
      0x28710000 Original region address
      0x28710000 Region address
      0x287100c0 Region primary address
      0 Region maximum allocation
      0 Region allocated
      Region allocations: 4094 allocations, 12894388 failures, 4007 frees, 1 longest
      Allocations by power-of-two sizes:
      1KB 34
      2KB 1
      4KB 0
      8KB 12898447
      16KB 0
      32KB 0
      64KB 0
      128KB 0
      256KB 0
      512KB 0
      1024KB 0
      REGION_JOIN_OK Region flags
      =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
      MPOOL structure:
      9 MPOOL region mutex 2 / 26M 0% 11489 / 674238720
      401 / 2533580 Maximum checkpoint LSN
      37 Hash table entries
      11 Hash table last-checked
      496749207 Hash table LRU count
      497385622 Put counter
      =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


      Please help me resolve both cases.

      Regards,

      Sury