It is contradicting the following pasted form the BDB documentation, right?Your original question was whether time is a part of our LSN calculation and, by implication, whether the time of a transaction affects the choice of master after a network partition.
"For a client to win an election, the replication group must currently have no master, and the client must have the most recent log records. In the case of clients having equivalent log records, the priority of the database environments participating in the election will determine the winner."
"We have to start with master generation because there are rare cases where relying solely on highest LSN is incorrect." Do you have an example of such case.The rare case I was originally thinking of concerns internal details of which log records should contribute to deciding the winner of an election. But there is also a much more common case.
Note that this presents a special problem for a replication group consisting of only two environments. If a master site fails, the remaining client can never comprise a majority of sites in the group. If the client application can reach a remote network site, or some other external tie-breaker, it may be able to determine whether it is safe to declare itself master. Otherwise it must choose between providing availability of a writable master (at the risk of duplicate masters), or strict protection against duplicate masters (but no master when a failure occurs). Replication Manager offers this choice via the DB_ENV->rep_set_config() method.This is precisely the trade-off you need to make. The only other option is that if it is possible to add one or more additional sites to your replication group, you would be able to elect a new master and durably replicate txns even with the loss of one site.
I don't fully understand why you say that highest LSN has nothing to do with "recent" in terms of time.LSNs get incremented when there is further log activity, regardless of how much actual time has passed since the last log update. So although LSNs increase as time goes by, you can't assume a particular amount of time has passed. And we do not have a way to relate a particular txn LSN to a particular time.
Assume i have a cluster with 2 machines only (2 sites strict off), when the cluster is connected and stable i expect the most rescent LSN to be the same on both machine since the client will inherit them form the master. Now if a split brain(say by disconnecting the cable of the client machine) happens and only one machine of the cluster stays connected to the external network i would expect the following happening always:Yes, in 5.1.19 this is the expected behavior.
1. There will be 2 master after the split brain
2. Only the LSN of 1 of them(the original master in that scenario) will grow since only one of them is still receiving activity from the external world which triggers writes to DB tables.
3. Hence because of 2., when you reconnect the cluster the original master will consistently be selected.
Is the above correct assuming i am using 5.1.19?Thx.
As for the case with the 3 sites that you are mentioning, even if i agree with the statement " It would be incorrect to acknowledge B's txns as permanent and then roll them back.", i am not sure i agree that this is the right thing from a business perspective for the exact same reason that I mentioned in the beginning of that thread that you could have had 2 permanent transactions on the B/C and 1 million transactions on A that are much more recent and you are going to roll them back. Is there a way to avoid that? At least for my case with a 2 sites cluster(without adding another site since number of boxes are a concern for our customers)?The only options available are the ones I have mentioned. When you have a 2-site replication group, you get durability of committed txns if you use 2SITE_STRICT. If you don't use 2SITE_STRICT, you increase master availability at the risk of duplicate masters and some txn rollback.