We're running a BDB replication client (using the C++ interface) in a simple one DB server, one rep client set up.
Normally the replication is working fine, but a couple weeks ago the rep client started increasing its memory usage to over 3gb in the space of a few minutes, then getting out-of-memory killed by the kernel, then when we restarted the client it would do the same thing.
After enabling the DB_CONFIG option set_verbose DB_VERB_REPLICATION we found it was filling the log with this output:
All of those messages seem normal. It shows the client receiving log records from
the master and applying those log records on the client. There is nothing in those
messages that shows anything out of the ordinary unfortunately.
Even if the client is storing log records that arrive out of order, it is using the
BDB mpool cache for that and therefore cannot use pages beyond the
configured size of your cache.
What version are you running? Is there anything different about what your
app is doing when this starts happening (i.e. a sudden burst of load on the
master, a network outage, a crash of a site in the group, etc)?
We're running BDB 5.2.28. This has only happened on one occasion (during which the client process died/restarted maybe 5~6 times), after removing/recreating the rep clients database it hasn't happened again. There didn't seem to be anything unusual on the master, and nothing was done on the master between the client breaking and the client being fixed. The only behavioural differences we observed on the client were that it's memory usage kept increasing until it was oom-killed, and the log output.
I'm not sure if any of this is relevant, but I've been comparing the log output while it was broken to the log output when it was fixed (the output in my original message was from when it was broken) -
When the client was working normally the messages ending in 'resend' only appeared while the client was restoring the database files, and during this time there were none of the 'Returning ISPERM' messages - once the client finished syncing the files it started generating the 'Returning ISPERM' messages, and there was no longer a 'resend' on the end of the 'rep_process_message' lines... e.g. here is some of the output during normal operation while it was restoring the db files: