This discussion is archived
9 Replies Latest reply: Jun 6, 2011 12:45 PM by Paula B RSS

Two master machine after reconnection

857786 Newbie
Currently Being Moderated
We are using the Berkeley DB 5.1.19 with the standard replication manager.
We test the given sample "excxx_repquote" in a two machine configuration ( one master and one slave)
with the following configuration

- ACK quoroum
- nsite=1
- priority =100
- bulk=1
-verbose on

We try to disconnect the cable between the two machines, and check the machine state(MASTER/CLIENT) according to BDB Events.

We notice that after disconnect the two machines became MASTER as expected but after reconnection the cable again
sometimes they stay as MASTER without running into election process, and sometimes it work as expected (one became CLIENT)

If we update the table on one of the machine it's automatically enter into election and finish well(one MASTER one CLIENT)

Since our solution the database is not updated frequently, we want to know if there is another way to ensure to get to the correct state
without develop a ReplicationMager based on the BDB BaseAPI .



Here a sample of session when both machine stay master:

LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExample -h /opt/bdb/ -l 2.0.0.110:12345 -r 2.0.0.210:12345 -a quorum -b -n 1

[Node 1 - 2.0.0.210:12345]

[1303813200:224828][6870/1114081600] excxx_repquote: init connection to site 2.0.0.110:12345 with result 115
[1303813201:226494][6870/1114081600] excxx_repquote: handshake from connection to 2.0.0.110:12345 EID 0
[1303813246:138904][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]
[1303813306:199920][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]
[1303813366:259927][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]
[1303813426:319924][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]
[1303813486:381340][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]
[1303813546:441939][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]
[1303813606:504223][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11350]


[Node 2 - 2.0.0.110:12345]
[1303817756:737444][28294/1134483776] excxx_repquote: accepted a new connection
[1303817756:737753][28294/1134483776] excxx_repquote: connection from 2.0.0.210:12345 EID 0 supersedes existing
[1303817785:985872][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303817846:51545][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303817906:118202][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303817966:181879][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303818026:247542][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303818086:311201][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303818146:374885][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303818206:439538][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]
[1303818266:501207][28294/1123993920] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][10654]




Here a sample of session that work as expected:


LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExample -h /opt/bdb/ -l 2.0.0.110:12345 -r 2.0.0.210:12345 -a quorum -b -n 1

[Node 1 - 2.0.0.210:12345]


[1303813837:576838][9750/1085184320] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type star
t_sync, LSN [1][13418] nobuf
[1303813837:576970][9750/1102199104] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11442]
[1303813837:602837][9750/1112688960] excxx_repquote: init connection to site 2.0.0.110:12345 with result 115
[1303813837:603267][9750/1112688960] excxx_repquote: handshake from connection to 2.0.0.110:12345 EID 0
[1303813837:719943][9750/1085184320] excxx_repquote: bulk_msg: Send buffer after copy due to PERM
[1303813837:719961][9750/1085184320] excxx_repquote: send_bulk: Send 252 (0xfc) bulk buffer bytes
[1303813837:719969][9750/1085184320] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type bulk
_log, LSN [1][13510]  flush perm
[1303813837:719992][9750/1085184320] excxx_repquote: will await acknowledgement: need 1
[1303813837:720330][9750/1154648384] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 26 eid 0, type du
pmaster, LSN [0][0]
[1303813837:759986][9750/1144158528] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 26 eid 0, type ne
wclient, LSN [0][0]
[1303813837:760054][9750/1144158528] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type news
ite, LSN [0][0] nobuf
[1303813837:760090][9750/1144158528] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type newm
aster, LSN [1][13566] nobuf
[1303813837:760123][9750/1144158528] excxx_repquote: NEWSITE info added site 2.0.0.110:12345
[1303813837:760142][9750/1144158528] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 28 eid 0, type ne
wmaster, LSN [1][12926]
[1303813837:760155][9750/1144158528] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type dupm
aster, LSN [0][0] nobuf
[1303813837:760425][9750/1133668672] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 26 eid 0, type vo
te1, LSN [1][12926]
[1303813837:760525][9750/1133668672] excxx_repquote: Master received vote
[1303813837:760540][9750/1133668672] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type newm
aster, LSN [1][13566] nobuf
[1303813837:760567][9750/1133668672] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 28 eid 0, type bu
lk_log, LSN [1][13302] perm
[1303813837:760575][9750/1133668672] excxx_repquote: Client record received on master
[1303813837:760585][9750/1133668672] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 28 eid 0, type bu
lk_log, LSN [1][13302] perm
[1303813837:760592][9750/1133668672] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type dupm
aster, LSN [0][0] nobuf
Tue Apr 26 06:30:37 2011 - DB_EVENT_REP_PERM_FAILED.
Tue Apr 26 06:30:37 2011 - Insufficient acknowledgements to guarantee transaction durability.
[1303813837:771840][9750/1085184320] excxx_repquote: rep_send_function returned: 110
[1303813838:822994][9750/1154648384] excxx_repquote: rep_start: Found old version log 17
[1303813838:823207][9750/1154648384] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type newc
lient, LSN [0][0] nobuf
Tue Apr 26 06:30:38 2011 - DB_EVENT_REP_CLIENT.



[Node 2 - 2.0.0.110:12345]

[1303818383:768511][1697/1116809536] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn[1][11442]
[1303818383:775392][1697/1106319680] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 26 eid -1, type sta
rt_sync, LSN [1][12778] nobuf
[1303818383:775924][1697/1106319680] excxx_repquote: bulk_msg: Send buffer after copy due to PERM
[1303818383:775945][1697/1106319680] excxx_repquote: send_bulk: Send 252 (0xfc) bulk buffer bytes
[1303818383:775953][1697/1106319680] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 26 eid -1, type bul
k_log, LSN [1][12870] flush perm
[1303818383:775975][1697/1106319680] excxx_repquote: will await acknowledgement: need 1
Tue Apr 26 14:46:23 2011 - DB_EVENT_REP_PERM_FAILED.
Tue Apr 26 14:46:23 2011 - Insufficient acknowledgements to guarantee transaction durability.
[1303818383:828424][1697/1106319680] excxx_repquote: rep_send_function returned: 110
excxx_repquote: can't read from site 2.0.0.210:12345: Connection reset by peer
[1303818385:199669][1697/1127299392] excxx_repquote: Repmgr: bust connection. Block archive
[1303818393:152555][1697/1127299392] excxx_repquote: accepted a new connection
[1303818393:152902][1697/1127299392] excxx_repquote: handshake from idle site 2.0.0.210:12345 EID 0
[1303818393:269644][1697/1158768960] excxx_repquote: /opt/bdb/ rep_process_message: msgv = 5 logv 17 gen = 28 eid 0, type b
ulk_log, LSN [1][13510] flush perm
[1303818393:269671][1697/1158768960] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 26 eid -1, type dup
master, LSN [0][0] nobuf
[1303818393:269882][1697/1158768960] excxx_repquote: rep_start: Found old version log 17
[1303818393:270050][1697/1158768960] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 26 eid -1, type new
client, LSN [0][0] nobuf
Tue Apr 26 14:46:33 2011 - DB_EVENT_REP_CLIENT.
excxx_repquote: ignoring event 4
[1303818393:270161][1697/1179748672] excxx_repquote: starting election thread
[1303818393:270209][1697/1179748672] excxx_repquote: Start election nsites 1, ack 1, priority 100
[1303818393:270223][1697/1179748672] excxx_repquote: Election thread owns egen 27
[1303818393:272244][1697/1179748672] excxx_repquote: Tallying VOTE1[0] (2147483647, 27)
[1303818393:272270][1697/1179748672] excxx_repquote: Beginning an election
[1303818393:272284][1697/1179748672] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 26 eid -1, type vot
e1, LSN [1][12926] nobuf
[1303818393:272310][1697/1179748672] excxx_repquote: Tallying VOTE2[0] (2147483647, 27)
[1303818393:272320][1697/1179748672] excxx_repquote: Counted my vote 1
[1303818393:272328][1697/1179748672] excxx_repquote: Skipping phase2 wait: already got 1 votes
[1303818393:272337][1697/1179748672] excxx_repquote: Got enough votes to win; election done; (prev) gen 26
[1303818393:272348][1697/1179748672] excxx_repquote: Election finished in 0.002117000 sec
[1303818393:272358][1697/1179748672] excxx_repquote: Election done; egen 28
excxx_repquote: ignoring event 5
[1303818393:272386][1697/1179748672] excxx_repquote: Ended election with 0, e_th 0, egen 28, flag 0x2a2c, e_fl 0x0, lo_fl 0
x4
[1303818393:272411][1697/1179748672] excxx_repquote: Election done; egen 28
[1303818393:272422][1697/1179748672] excxx_repquote: New master gen 28, egen 29
[1303818393:272829][1697/1179748672] excxx_repquote: rep_start: Old log version was 17
[1303818393:272836][1697/1179748672] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type new
master, LSN [1][12926] nobuf
[1303818393:272854][1697/1179748672] excxx_repquote: restore_prep: No prepares. Skip.
[1303818393:273175][1697/1179748672] excxx_repquote: bulk_msg: Send buffer after copy due to PERM
[1303818393:273184][1697/1179748672] excxx_repquote: send_bulk: Send 480 (0x1e0) bulk buffer bytes
[1303818393:273190][1697/1179748672] excxx_repquote: /opt/bdb/ rep_send_message: msgv = 5 logv 17 gen = 28 eid -1, type bul
k_log, LSN [1][13302] perm
Tue Apr 26 14:46:33 2011 - DB_EVENT_REP_MASTER.
---------------------------------------------------------------------------------------------------------------------------------------



Thanks in advance
if any idea can be propose
  • 1. Re: Two master machine after reconnection
    Paula B Explorer
    Currently Being Moderated
    In this particular example, your replication group consists of two sites (2.0.0.210:12345 and (2.0.0.110:12345). This means that your nsites value should be 2 on both sites. You should certainly make this adjustment, but I don't think this adjustment by itself will change this behavior.

    You can prevent this behavior by setting the DB_REPMGR_CONF_2SITE_STRICT flag on both sites. This would prevent your original client from becoming a duplicate master when the connection is broken. We don't use this flag in excxx_repquote because it is a simple example, but you can modify the example to try this if you wish.

    If you don't use DB_REPMGR_CONF_2SITE_STRICT, the sites can diverge while they are disconnected because there is nothing to stop you from doing updates on both sites. Once the sites are reconnected, the duplicate master situation is detected as a result of the underlying replication messages from the first update attempt at either site and that causes an election. I don't believe there is any other more automatic way to detect the duplicate master.

    Without 2SITE_STRICT, the divergent transactions at the site that loses the election could be rolled back. If avoiding such rollbacks is important to you, you should use 2SITE_STRICT.

    Paula Bingham
    Oracle
  • 2. Re: Two master machine after reconnection
    857786 Newbie
    Currently Being Moderated
    Thanks Paula!

    If I understand correctly, this option prevents the client to become master when the original master fails or becomes disconnected.

    In our solution we use 2-node configuration having a master and a client which is supposed to take over when the master goes down.

    In your suggestion, you prevent the two machines to become master after disconnecting.
    I'm not worried about this case, my problem is why the two machines remain master after the connection comes back.
  • 3. Re: Two master machine after reconnection
    Paula B Explorer
    Currently Being Moderated
    We can only detect that there are two masters when we process replication messages from write activity on one of the sites. Once we detect that there are two masters, we automatically call the election that resolves this situation.

    Why is it important to you to resolve the two masters before the first write operation attempt on one of them?

    My suggestion of 2SITE_STRICT is a way to prevent this situation at the cost of replication group availability for write operations when the master is unavailable. You have indicated that you need this availability.

    Your other option is a workaround - having your application perform a forced checkpoint or a "dummy" write operation periodically to trigger the underlying replication messages that will detect the duplicate masters. As a matter of fact, excxx_repquote already has a checkpoint thread that performs an unforced checkpoint every 60 seconds. You can change this to perform the checkpoint with the DB_FORCE flag and perhaps more frequently if you would like to experiment with this approach.

    Paula Bingham
    Oracle
  • 4. Re: Two master machine after reconnection
    Paula B Explorer
    Currently Being Moderated
    To follow up, I wanted to let you know that we were able to add something to repmgr to detect duplicate masters in our next release. Thank you for reporting this!

    Paula Bingham
    Oracle
  • 5. Re: Two master machine after reconnection
    857786 Newbie
    Currently Being Moderated
    What is the next release number in which it will be include and when is it scheduled ?
  • 6. Re: Two master machine after reconnection
    Paula B Explorer
    Currently Being Moderated
    We don't publish projected release dates, but here is a post that provides as much of an answer as possible to your question:

    Re: Release roadmap

    Paula Bingham
    Oracle
  • 7. Re: Two master machine after reconnection
    857786 Newbie
    Currently Being Moderated
    Hi Paula
    following to your instructions we test different options

    - a dummy write in the database on both nodes : the process is really working well and after a small time on on the MASTER site return to be CLIENT as we expected
    but this action disturb the mechanism of LSN and after network reconnection the database choose as MASTER is not always the most update and we can loose some important data after the re-sync step

    - a dummy write in only one process : in a large number of case the process can't detect that the other side has been reconnect and never enter in an election an stay infinitely in MASTER-MASTER as we describes in the first part of the discussion

    - perform checkpoint with flag set to TXN_FORCE is working even if we are running it at only one node but I'm affraid that it will degrade the performance of transaction writing to the DB, running checkpoint force each second for example will cause o lot of overhead process in the DB degrading too much the performance of write at high rate. And we saw the same problem that it disturb the LSN mechanism and after reconnection the database choose if not always the most update


    So we are stuck in our application ...
    How can I assure that split-brain between the two node will always return with an election procedure assuring that the most updated database will be choose as MASTER without entering false write/or checkpoint that will disturb the most updated database

    Thanks in advance...
  • 8. Re: Two master machine after reconnection
    857786 Newbie
    Currently Being Moderated
    I reopen the thread for some new question added following to the response
  • 9. Re: Two master machine after reconnection
    Paula B Explorer
    Currently Being Moderated
    Some internal discussion about this reminded me about a much lower-overhead alternative to forcing a checkpoint. We have a rep_flush() API that simply rebroadcasts the last log record. This call will not increase the LSN or cause I/O. The rep_flush call is not documented but it is accessible to API users and its specification is:

    int
    DB_ENV->rep_flush(DB_ENV *env); 

    The only other suggestion we have is to add a third electable site to your replication group. This would insulate you from the single point of failure problems you are having with your two-site replication group.

    A two-site replication group is a special case for our replication implementation. If you have a single point of failure (one of the sites) you do not have a functioning replication group because you only have one site left. As you have seen, if the remaining site is not ready for replication (e.g. still in client sync) or goes down, you lose the replication group and possibly some data.

    If you add a third electable site to your replication group, then when one site is unavailable you still have a functioning replication group that can hold elections and maintain more than one copy of the replicated data.

    If you are absolutely confined to two sites, these are your alternatives:

    1. Use 2SITE_STRICT and reduce your replication group availability to guarantee no loss of data.

    2. Do not use 2SITE_STRICT and risk the loss of some data when a site rejoins the replication group.

    3. If you do not use 2SITE_STRICT and you want detection of DUPMASTER (split-brain) without application activity, you will have to provide your own workaround in the form of a periodic additional replication call. Hopefully, the rep_flush() call I mentioned above will be lower overhead for you.

    Paula Bingham
    Oracle

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points