This content has been marked as final. Show 6 replies
I need more information to gain an idea about what might be happening.
What version of BDB are you using?
Can you send me the contents of __db.rep.diag00 and __db.rep.diag01 from
the failing environment in email? Use firstname.lastname@example.org using
my name shown below.
Do you have verbose messaging turned on at the time? If not,
can you reproduce with full verbose messaging?
How long were the two sites disconnected from each other?
After reconnecting I would expect both sites to detect a DUPMASTER situation,
both downgrade to client and then hold an election. Do you know if the
failing site was the newly elected master or is it client replica?
What log files exist on the other site at the time of the failure?
We'll start with these questions and see where it leads us.
Thx for your reply.
I will check with if I can provide that information. I am not sure we have the verbose messages sicne we usually turn them off due to the perf impact.
Some more background on the below tests, we are having a scripts that fills Berkely DB tables and at the same time every 3 mn we do alternatively a split/unsplit brain.
Again log archiving is done every 30 secs on each active machine, the rate is about 1000 rows written per sec.
The BDB version we are using is 5.1.19.
I will check if i have the files required below.
Is that possible in the meantime to get some answers on the theoretical questions that i have raised above, i guess getting more understanding will help me more diagnosing if i am doing a misuse of the framework or not.
Thx a lot.
Have a good day.
We discussed this issue today and have found a situation that might cause this. We are discussing the bug and possible fixes internally. We'll post more when we have more information.
Is there any more information available on this thread?
Thx in advance.
It turns out that the scenario we considered is handled properly in 5.1.19. While we're looking for other situations that fit your description, it would still be helpful if you could send the __db.rep.diag00/01 files we previously asked for.
I am not seeing any holes in the log archiving coordination. Do you have a stack trace
from this panic? I think the only way to make progress is to get the __db.rep.diag files, which
may provide a clue about a path we're overlooking as it may indicate what generally was going
on at the site at the time. But a stack trace, as well as the
contents of *rep from gdb would be additional clues.