This discussion is archived
2 Replies Latest reply: Aug 31, 2011 6:13 PM by 885300 RSS

Replication stuck in bad state, then out-of-memory killed.

885300 Newbie
Currently Being Moderated
Hi,

We're running a BDB replication client (using the C++ interface) in a simple one DB server, one rep client set up.

Normally the replication is working fine, but a couple weeks ago the rep client started increasing its memory usage to over 3gb in the space of a few minutes, then getting out-of-memory killed by the kernel, then when we restarted the client it would do the same thing.

After enabling the DB_CONFIG option set_verbose DB_VERB_REPLICATION we found it was filling the log with this output:

2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2143][4874/4130962288] CLIENT: Returning ISPERM [12864][3364833], cmp = 0
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2151][4874/4122569584] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2302][4874/4122569584] CLIENT: rep_apply: Decrement apply_th 0 [12864][3365033]
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2305][4874/4130962288] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3365033] resend
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2393][4874/4122569584] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3365105] resend
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2476][4874/4130962288] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2604][4874/4130962288] CLIENT: rep_apply: Decrement apply_th 0 [12864][3365105]
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2706][4874/4122569584] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2705][4874/4130962288] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3365177] resend
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2808][4874/4122569584] CLIENT: rep_apply: Decrement apply_th 0 [12864][3365177]
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2912][4874/4130962288] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170722 [D ReplicationClient 4874] BDB: [1313652196:2922][4874/4122569584] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3365233] resend
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3016][4874/4130962288] CLIENT: rep_apply: Decrement apply_th 0 [12864][3365233]
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3128][4874/4122569584] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3214][4874/4122569584] CLIENT: rep_apply: Decrement apply_th 0 [12864][3365305]
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3331][4874/4122569584] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3365305] resend
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3335][4874/4130962288] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3366009] resend
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3436][4874/4122569584] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3561][4874/4122569584] CLIENT: rep_apply: Decrement apply_th 0 [12864][3366009]
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3651][4874/4130962288] CLIENT: rep_apply: Set apply_th 1
2011/08/18 02:23:16 3687170723 [D ReplicationClient 4874] BDB: [1313652196:3662][4874/4122569584] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12864][3366153] resend
2011/08/18 02:23:16 3687170754 [D ReplicationClient 4874] BDB: [1313652196:34925][4874/4130962288] CLIENT: rep_apply: Decrement apply_th 0 [12864][3366153]

Unsure what else to try, we removed the replicated database. When we restarted the rep client it recreated the database from the master again and continued working happily.

Does anyone know what the log output we were seeing means, and what could have caused the replicated database to get into the state that caused it? We would like to ensure it does not happen again.

Thanks,
Thomas
  • 1. Re: Replication stuck in bad state, then out-of-memory killed.
    524722 Explorer
    Currently Being Moderated
    All of those messages seem normal. It shows the client receiving log records from
    the master and applying those log records on the client. There is nothing in those
    messages that shows anything out of the ordinary unfortunately.

    Even if the client is storing log records that arrive out of order, it is using the
    BDB mpool cache for that and therefore cannot use pages beyond the
    configured size of your cache.

    What version are you running? Is there anything different about what your
    app is doing when this starts happening (i.e. a sudden burst of load on the
    master, a network outage, a crash of a site in the group, etc)?

    Sue LoVerso
    Oracle
  • 2. Re: Replication stuck in bad state, then out-of-memory killed.
    885300 Newbie
    Currently Being Moderated
    We're running BDB 5.2.28. This has only happened on one occasion (during which the client process died/restarted maybe 5~6 times), after removing/recreating the rep clients database it hasn't happened again. There didn't seem to be anything unusual on the master, and nothing was done on the master between the client breaking and the client being fixed. The only behavioural differences we observed on the client were that it's memory usage kept increasing until it was oom-killed, and the log output.

    I'm not sure if any of this is relevant, but I've been comparing the log output while it was broken to the log output when it was fixed (the output in my original message was from when it was broken) -

    When the client was working normally the messages ending in 'resend' only appeared while the client was restoring the database files, and during this time there were none of the 'Returning ISPERM' messages - once the client finished syncing the files it started generating the 'Returning ISPERM' messages, and there was no longer a 'resend' on the end of the 'rep_process_message' lines... e.g. here is some of the output during normal operation while it was restoring the db files:

    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191286][5396/4121959280] CLIENT: FILEDONE: have 84 pages. Need 991.
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191378][5396/4130351984] CLIENT: PAGE: Received page 84 from file 1
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191375][5396/4121959280] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type page, LSN [12866][2935520] resend
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191482][5396/4130351984] CLIENT: PAGE: Write page 84 into mpool
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191580][5396/4130351984] CLIENT: PAGE_GAP: pgno 84, max_pg 990 ready 84, waiting 0 max_wait 0
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191663][5396/4130351984] CLIENT: FILEDONE: have 85 pages. Need 991.
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191744][5396/4130351984] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type page, LSN [12866][2935520] resend
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191748][5396/4121959280] CLIENT: PAGE: Received page 85 from file 1
    2011/08/18 02:26:31 3687365911 [D ReplicationClient 5396] BDB: [1313652391:191901][5396/4121959280] CLIENT: PAGE: Write page 85 into mpool
    2011/08/18 02:26:31 3687365912 [D ReplicationClient 5396] BDB: [1313652391:192001][5396/4121959280] CLIENT: PAGE_GAP: pgno 85, max_pg 990 ready 85, waiting 0 max_wait 0

    and here is some of the output during normal operation after the db has finished the initial sync:

    2011/08/18 02:26:50 3687384879 [D ReplicationClient 5396] BDB: [1313652410:159509][5396/4121959280] CLIENT: Returning ISPERM [12866][2995928], cmp = 0
    2011/08/18 02:27:12 3687407340 [D ReplicationClient 5396] BDB: [1313652432:620174][5396/4130351984] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12866][2996472]
    2011/08/18 02:27:12 3687407340 [D ReplicationClient 5396] BDB: [1313652432:620336][5396/4130351984] CLIENT: rep_apply: Set apply_th 1
    2011/08/18 02:27:12 3687407340 [D ReplicationClient 5396] BDB: [1313652432:620428][5396/4130351984] CLIENT: rep_apply: Decrement apply_th 0 [12866][2996548]
    2011/08/18 02:27:12 3687407340 [D ReplicationClient 5396] BDB: [1313652432:620910][5396/4121959280] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12866][2996548]
    2011/08/18 02:27:12 3687407341 [D ReplicationClient 5396] BDB: [1313652432:621035][5396/4121959280] CLIENT: rep_apply: Set apply_th 1
    2011/08/18 02:27:12 3687407341 [D ReplicationClient 5396] BDB: [1313652432:621065][5396/4130351984] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12866][2996620]
    2011/08/18 02:27:12 3687407341 [D ReplicationClient 5396] BDB: [1313652432:621118][5396/4121959280] CLIENT: rep_apply: Decrement apply_th 0 [12866][2996620]
    2011/08/18 02:27:12 3687407341 [D ReplicationClient 5396] BDB: [1313652432:621234][5396/4130351984] CLIENT: rep_apply: Set apply_th 1
    2011/08/18 02:27:12 3687407341 [D ReplicationClient 5396] BDB: [1313652432:621232][5396/4121959280] CLIENT: databases rep_process_message: msgv = 6 logv 18 gen = 61 eid 0, type log, LSN [12866][2996688]

    As I said I'm not sure if any of this is even relevant, but it makes me wonder if the master was constantly resending information that it shouldn't have been or something along those lines?

    Thanks for the help,
    Thomas

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points