As I dig deeper and deeper it seems the problem is some sort of interaction between my application threads and the repmgr election thread. I have yet to be able to produce a minimal test case, to verify this. But what I have seen is the following:
1. I startup a threaded, transactional database A starting the repmgr with DB_REP_ELECTION
2. I startup a threaded, transactional database B also with DB_REP_ELECTION
3. A is MASTER, B is CLIENT
4. I then kill A, forcing B to become MASTER
5. I then issue a number of queries against B.
Doing these steps I am able to eventually get database B to fail in a call to __db_check_txn, the error string is "Transaction that opened the DB is still active".
Although my application is multi-threaded, I've reduced it to run with only a single thread. Furthremore the value of dbp->cur_locker->tid is the same thread id as the thread issuing the DB->get which causes the error.
I have enough debug logging on to know that the election thread exits before this error appears, but I can also see that the error does not seem to appear if no election ever occurs (i.e. if I only start up a single database this doesn't happen) and the error also does not appear when the MASTER database was not first a CLIENT.
I am building against libdb.a version 5.1.25
The database handles are all opened with DB_AUTO_COMMIT, and the calls to DB->get are passing NULL in the txn parameter.
Although I have no immediate ideas about the issue, could you please turn on replication verbose messages in your application and reproduce the error with the simplest setup (i.e. 1 thread). You can see the dbenv->set_verbose man page and use the DB_VERB_REPLICATION flag.
That will genereate a lot of output. You can then contact me by email using the typical form of email@example.com as I've spelled it below. Thanks.