This discussion is archived
8 Replies Latest reply: Oct 4, 2011 1:03 AM by 857786 RSS

Question on LSN computing and replication

857786 Newbie
Currently Being Moderated
Hi all reader of OTN forum,

I have a question about the LSN computing

We know that the process of election in replication is based on the LSN sequence number and the priority.

1) assuming there are two nodes for example, and there is a partition in the network .
2) assuming that the two nodes are MASTER and data is write into the two database (w/o using DB_REPMGR_CONF_2SITE_STRICT flag)

If one node (node-A) insert a very big number of record at a time and then the second node (node-B) insert few number of record but later in the time
at reelection can we consider that the Node-B will be choose and all the record insert by node-A will be rollback
or the presumably node elected will be the node-A because of the number of record added

In other word is the computing of the next-LSN depending of the time the record is save or not ?

Thanks in andvance
  • 1. Re: Question on LSN computing and replication
    Paula B Explorer
    Currently Being Moderated
    The time that log records are added to the log is not a factor in calculating next_lsn or deciding which site will win an election. We compute LSNs solely based on the amount of space each log record uses.

    The site with the most recent log records wins an election. This is determined first by master generation and then by LSN. The master generation is increased each time a new master is elected, so a lower master generation is considered an "older" master and a higher master generation is a "newer" master.

    When the partition occurred in your example, one site was already master and the other site then elected itself master. For now, let's say A was the original master and B elected itself master after the partition. So if A was at gen 15 it remains there and B's gen will become 16 after it wins its election. When the partition is resolved, we consider the B's higher gen before considering the LSNs or the priority, so B will win the election.

    There are many other examples where considering gen first is the right thing to do. If there has been an election, generally anything that happens after that election is considered more current and an "older" master shouldn't overrule it.

    But in this example of a 2 site replication group that doesn't use 2SITE_STRICT, it is more of a toss-up which site's transactions should take priority and there are no other sites to help settle the toss-up. This configuration prioritizes availability over durability and this is one of the consequences.

    Paula Bingham
    Oracle
  • 2. Re: Question on LSN computing and replication
    857786 Newbie
    Currently Being Moderated
    Thx Paula for the reply.
    I am still unclear on the explanations below.

    You say:
    "The time that log records are added to the log is not a factor in ... deciding which site will win an election" -> Which means that time is not used for wining election?

    Then you say:
    "The site with the most recent log records wins an election" -> Doesn't that contradict the first statement?

    Please advise.

    Additionally if i understand correctly the election wining is determined first by master generation and only then by LSN. The difficulty i have is that how can that be appropriate for network partitioning.
    Assume you have 1 active and 1 passive like in our case, if the passive is unplugged from network, there will 2 actives machine but as you said the unplugged machine will have a higher master generation since it just became active. When you restore the network, why the fact that the initial passive(which is now active) which has no activity is selected just because of its higher gen number.
    Why the fact that the active that had many recent records is not taken into account in that case and therefore all of its records will be rolled back after the network is restored.
    Is there a way to change that?

    Our applications are manipulating data that is critical to keep up to date and make sure to use the most recent data when there is failure. How can we achieve that?

    One last question, if gen is used over LSN in election decision, does that mean that LSN is used only when the gens of the 2 sites are equal?

    Thx.

    Edited by: 854783 on Sep 8, 2011 1:58 AM
  • 3. Re: Question on LSN computing and replication
    Paula B Explorer
    Currently Being Moderated
    I'm sorry, saying "the site with the most recent log records wins an election" was a poor choice of words. In that sentence, I should have said "the site with the highest LSN wins an election." So yes, election winner is determined first by master generation, then by highest LSN, then by site priority, then by arbitrary tiebreaker. We have to start with master generation because there are rare cases where relying solely on highest LSN is incorrect.

    Are you still asking about a 2-site replication group here? I'm not sure what you mean by active and passive. Do you mean active=master and passive=client/replica?

    When there is a network partition leading to duplicate masters in a 2-site replication group with 2SITE_STRICT off, it is quite possible that there are new transactions on both sites. In this case you are correct that our master generation algorithm favors the formerly passive site's new transactions when the partition is resolved.

    If you don't want to risk losing some transactions during a network partition, you have to use 2SITE_STRICT. In your example, the original master will continue to accept write transactions, but the isolated passive site will be readonly until the network partition is resolved.

    Paula Bingham
    Oracle
  • 4. Re: Question on LSN computing and replication
    857786 Newbie
    Currently Being Moderated
    Thx for the reply.
    So you say that master generation is taken into account first over LSN.
    It is contradicting the following pasted form the BDB documentation, right?
    "For a client to win an election, the replication group must currently have no master, and the client must have the most recent log records. In the case of clients having equivalent log records, the priority of the database environments participating in the election will determine the winner."

    In terms of configuration yes, we are using a 2-site replication group with 2SITE_STRICT off.

    You say:
    "We have to start with master generation because there are rare cases where relying solely on highest LSN is incorrect." Do you have an example of such case.
    I guess that it will be very difficult for us to justify to our customers that after unsplit brain(reconection from a network partition) it might be that the box that is not necessarily the most updated will be selected as the master. Is there really no way to make sure that LSN is taken into account first? Do you have any arguments that we could use with our customers to justify such behavior?

    I guess that using 2SITE_STRICT is not an option unless i don't understand that option. My understanding is that if i use that option and the master is unplugged from network (equivalent to network partition) then no failover will happen which defeats the purpose of having high availability using BDB. Am i right?
  • 5. Re: Question on LSN computing and replication
    Paula B Explorer
    Currently Being Moderated
    It is contradicting the following pasted form the BDB documentation, right?
    "For a client to win an election, the replication group must currently have no master, and the client must have the most recent log records. In the case of clients having equivalent log records, the priority of the database environments participating in the election will determine the winner."
    Your original question was whether time is a part of our LSN calculation and, by implication, whether the time of a transaction affects the choice of master after a network partition.

    The terminology "recent" is confusing with regard to your question. Internally, we often use "most recent LSN" to refer to the LSN with the highest numbers (e.g. [4][2345] is more "recent" than [3][4231]) and this is reflected in our documentation. But it still has nothing to do with actual time.

    The datagen modification to our election algorithm was done for BDB 5.2.The datagen modification in 5.2 changed our internal definition of "most recent LSN" to be "highest datagen and then highest LSN numbers". We will take an action to consider a documentation improvement to clarify this in a future release.

    If you are using an earlier version of BDB, then you will not see this modification, but instead will have the election winner determined first by highest LSN numbers. But you must be aware of this change if you intend to upgrade to 5.2 in the future.
    "We have to start with master generation because there are rare cases where relying solely on highest LSN is incorrect." Do you have an example of such case.
    The rare case I was originally thinking of concerns internal details of which log records should contribute to deciding the winner of an election. But there is also a much more common case.

    We have the concept of acknowledgment policies and permanent transactions to handle txn durability. If you want txn durability across an election, you must choose an ack policy that requires at least a majority of sites to acknowledge a txn. This is because elections are always majority-based. Note that we define a majority as (nsites/2)+1.

    Let me explain the issue with a 3-site example first.

    If we have original master A and clients B and C, and then A is isolated and B/C elect B as a separate master, during this partition B's txns can be acknowledged as permanent by a majority of sites in the group but A's txns cannot. We have made a guarantee about B's txns that we have not made about A's txns. When the partition is resolved, A's txns were never acknowledged as permanent and it is not incorrect to roll them back. It would be incorrect to acknowledge B's txns as permanent and then roll them back.

    The 3-site example illustrates the general reason for this change. However, a 2-site replication group presents a special problem. This is explained later in the same Elections section of our Programmer's Reference that you quoted:
    Note that this presents a special problem for a replication group consisting of only two environments. If a master site fails, the remaining client can never comprise a majority of sites in the group. If the client application can reach a remote network site, or some other external tie-breaker, it may be able to determine whether it is safe to declare itself master. Otherwise it must choose between providing availability of a writable master (at the risk of duplicate masters), or strict protection against duplicate masters (but no master when a failure occurs). Replication Manager offers this choice via the DB_ENV->rep_set_config() method.
    This is precisely the trade-off you need to make. The only other option is that if it is possible to add one or more additional sites to your replication group, you would be able to elect a new master and durably replicate txns even with the loss of one site.

    Your understanding of the 2SITE_STRICT option is correct.

    Paula Bingham
    Oracle
  • 6. Re: Question on LSN computing and replication
    857786 Newbie
    Currently Being Moderated
    Thx Paula. That is really helpful.
    So i guess in our case it don't take the datagen into account since we are using version 5.1.19. Therefore i would expect that only the machine with the highest LSN will be selected. I don't fully understand why you say that highest LSN has nothing to do with "recent" in terms of time.
    Is the following logic correct?
    Assume i have a cluster with 2 machines only (2 sites strict off), when the cluster is connected and stable i expect the most rescent LSN to be the same on both machine since the client will inherit them form the master. Now if a split brain(say by disconnecting the cable of the client machine) happens and only one machine of the cluster stays connected to the external network i would expect the following happening always:
    1. There will be 2 master after the split brain
    2. Only the LSN of 1 of them(the original master in that scenario) will grow since only one of them is still receiving activity from the external world which triggers writes to DB tables.
    3. Hence because of 2., when you reconnect the cluster the original master will consistently be selected.

    Is the above correct assuming i am using 5.1.19?Thx.

    As for the case with the 3 sites that you are mentioning, even if i agree with the statement " It would be incorrect to acknowledge B's txns as permanent and then roll them back.", i am not sure i agree that this is the right thing from a business perspective for the exact same reason that I mentioned in the beginning of that thread that you could have had 2 permanent transactions on the B/C and 1 million transactions on A that are much more recent and you are going to roll them back. Is there a way to avoid that? At least for my case with a 2 sites cluster(without adding another site since number of boxes are a concern for our customers)?
  • 7. Re: Question on LSN computing and replication
    Paula B Explorer
    Currently Being Moderated
    I don't fully understand why you say that highest LSN has nothing to do with "recent" in terms of time.
    LSNs get incremented when there is further log activity, regardless of how much actual time has passed since the last log update. So although LSNs increase as time goes by, you can't assume a particular amount of time has passed. And we do not have a way to relate a particular txn LSN to a particular time.
    Assume i have a cluster with 2 machines only (2 sites strict off), when the cluster is connected and stable i expect the most rescent LSN to be the same on both machine since the client will inherit them form the master. Now if a split brain(say by disconnecting the cable of the client machine) happens and only one machine of the cluster stays connected to the external network i would expect the following happening always:
    1. There will be 2 master after the split brain
    2. Only the LSN of 1 of them(the original master in that scenario) will grow since only one of them is still receiving activity from the external world which triggers writes to DB tables.
    3. Hence because of 2., when you reconnect the cluster the original master will consistently be selected.
    Is the above correct assuming i am using 5.1.19?Thx.
    Yes, in 5.1.19 this is the expected behavior.
    As for the case with the 3 sites that you are mentioning, even if i agree with the statement " It would be incorrect to acknowledge B's txns as permanent and then roll them back.", i am not sure i agree that this is the right thing from a business perspective for the exact same reason that I mentioned in the beginning of that thread that you could have had 2 permanent transactions on the B/C and 1 million transactions on A that are much more recent and you are going to roll them back. Is there a way to avoid that? At least for my case with a 2 sites cluster(without adding another site since number of boxes are a concern for our customers)?
    The only options available are the ones I have mentioned. When you have a 2-site replication group, you get durability of committed txns if you use 2SITE_STRICT. If you don't use 2SITE_STRICT, you increase master availability at the risk of duplicate masters and some txn rollback.

    Paula Bingham
    Oracle
  • 8. Re: Question on LSN computing and replication
    857786 Newbie
    Currently Being Moderated
    Thx a lot for the answers and patience on this thread.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points