I would like to validate my understanding of the log archiving that the BDB API is exposing and based on that try to explain some phenomenon that we are experiencing.
The way we archive logs is using the following:
env->log_archive(env, &list, DB_ARCH_REMOVE)
According to Berkeley DB configuration the flag DB_ARCH_REMOVE will figure out which logs are obsolete and remove them:
Pasted from BDB doc:
"DB_ARCH_REMOVE : Remove log files that are no longer needed; no filenames are returned. "
First question: what is the accurate criteria for removing them ? Is that based on transaction opened/closed etc..?
We are using a replication cluster and we remove unused log every 30 secs, is that a reasonable time?
Following is an error i am seeing from time to time in Berkeley after reconnecting a cluster(2 boxes only) that its network was cut off:
Sep 11 17:40:11 pxehost-32-120 svsde_out: Checkpoint LSN record  not found
Sep 11 17:40:11 pxehost-32-120 svsde_out: DB_ENV->rep_process_message: DB_NOTFOUND: No matching key/data pair found
Sep 11 17:40:11 pxehost-32-120 svsde_out: message thread failed: DB_NOTFOUND: No matching key/data pair found
Sep 11 17:40:11 pxehost-32-120 svsde_out: PANIC: DB_NOTFOUND: No matching key/data pair found
My understanding(asking for validation here) of that is that once election is held there is step of validation where the node is recovering in which it starts going to over the log to get to a valid checkpoint. It looks to me that when it does that it is looking for a certain LSN that it is not finding and then raises the panic error.
I have a few questions that i would like to understand:
1. When do you think such recover is triggered? What condition triggers that? Is that normal?
2. Once this is triggered, how BDB decides what LSN to look for? What is the exact criteria?
3. If the problem is because log archive removed something it shouldn't have, then it looks like a bug sine we used the flags mentioned?
4. Should we use the lor archive differently?
Thx a lot.
I need more information to gain an idea about what might be happening.
What version of BDB are you using?
Can you send me the contents of __db.rep.diag00 and __db.rep.diag01 from
the failing environment in email? Use firstname.lastname@example.org using
my name shown below.
Do you have verbose messaging turned on at the time? If not,
can you reproduce with full verbose messaging?
How long were the two sites disconnected from each other?
After reconnecting I would expect both sites to detect a DUPMASTER situation,
both downgrade to client and then hold an election. Do you know if the
failing site was the newly elected master or is it client replica?
What log files exist on the other site at the time of the failure?
We'll start with these questions and see where it leads us.
Thx for your reply.
I will check with if I can provide that information. I am not sure we have the verbose messages sicne we usually turn them off due to the perf impact.
Some more background on the below tests, we are having a scripts that fills Berkely DB tables and at the same time every 3 mn we do alternatively a split/unsplit brain.
Again log archiving is done every 30 secs on each active machine, the rate is about 1000 rows written per sec.
The BDB version we are using is 5.1.19.
I will check if i have the files required below.
Is that possible in the meantime to get some answers on the theoretical questions that i have raised above, i guess getting more understanding will help me more diagnosing if i am doing a misuse of the framework or not.
Thx a lot.
Have a good day.
We discussed this issue today and have found a situation that might cause this. We are discussing the bug and possible fixes internally. We'll post more when we have more information.
It turns out that the scenario we considered is handled properly in 5.1.19. While we're looking for other situations that fit your description, it would still be helpful if you could send the __db.rep.diag00/01 files we previously asked for.
I am not seeing any holes in the log archiving coordination. Do you have a stack trace
from this panic? I think the only way to make progress is to get the __db.rep.diag files, which
may provide a clue about a path we're overlooking as it may indicate what generally was going
on at the site at the time. But a stack trace, as well as the
contents of *rep from gdb would be additional clues.