We are using the Berkeley DB 5.1.19 with the standard replication manager.
We test the given sample "excxx_repquote" in a two machine configuration ( one master and one slave)
with the following configuration
We try to disconnect the cable between the two machines, and check the machine state(MASTER/CLIENT) according to BDB Events.
We notice that after disconnect the two machines became MASTER as expected but after reconnection the cable again
sometimes they stay as MASTER without running into election process, and sometimes it work as expected (one became CLIENT)
If we update the table on one of the machine it's automatically enter into election and finish well(one MASTER one CLIENT)
Since our solution the database is not updated frequently, we want to know if there is another way to ensure to get to the correct state
without develop a ReplicationMager based on the BDB BaseAPI .
Here a sample of session when both machine stay master:
LD_LIBRARY_PATH=/usr/local/BerkeleyDB.5.1/lib/ /root/RepQuoteExample -h /opt/bdb/ -l 126.96.36.199:12345 -r 188.8.131.52:12345 -a quorum -b -n 1
[Node 1 - 184.108.40.206:12345]
[1303813200:224828][6870/1114081600] excxx_repquote: init connection to site 220.127.116.11:12345 with result 115
[1303813201:226494][6870/1114081600] excxx_repquote: handshake from connection to 18.104.22.168:12345 EID 0
[1303813246:138904][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
[1303813306:199920][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
[1303813366:259927][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
[1303813426:319924][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
[1303813486:381340][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
[1303813546:441939][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
[1303813606:504223][6870/1103591744] excxx_repquote: Repmgr_stable_lsn: Returning stable_lsn
In this particular example, your replication group consists of two sites (22.214.171.124:12345 and (126.96.36.199:12345). This means that your nsites value should be 2 on both sites. You should certainly make this adjustment, but I don't think this adjustment by itself will change this behavior.
You can prevent this behavior by setting the DB_REPMGR_CONF_2SITE_STRICT flag on both sites. This would prevent your original client from becoming a duplicate master when the connection is broken. We don't use this flag in excxx_repquote because it is a simple example, but you can modify the example to try this if you wish.
If you don't use DB_REPMGR_CONF_2SITE_STRICT, the sites can diverge while they are disconnected because there is nothing to stop you from doing updates on both sites. Once the sites are reconnected, the duplicate master situation is detected as a result of the underlying replication messages from the first update attempt at either site and that causes an election. I don't believe there is any other more automatic way to detect the duplicate master.
Without 2SITE_STRICT, the divergent transactions at the site that loses the election could be rolled back. If avoiding such rollbacks is important to you, you should use 2SITE_STRICT.
If I understand correctly, this option prevents the client to become master when the original master fails or becomes disconnected.
In our solution we use 2-node configuration having a master and a client which is supposed to take over when the master goes down.
In your suggestion, you prevent the two machines to become master after disconnecting.
I'm not worried about this case, my problem is why the two machines remain master after the connection comes back.
We can only detect that there are two masters when we process replication messages from write activity on one of the sites. Once we detect that there are two masters, we automatically call the election that resolves this situation.
Why is it important to you to resolve the two masters before the first write operation attempt on one of them?
My suggestion of 2SITE_STRICT is a way to prevent this situation at the cost of replication group availability for write operations when the master is unavailable. You have indicated that you need this availability.
Your other option is a workaround - having your application perform a forced checkpoint or a "dummy" write operation periodically to trigger the underlying replication messages that will detect the duplicate masters. As a matter of fact, excxx_repquote already has a checkpoint thread that performs an unforced checkpoint every 60 seconds. You can change this to perform the checkpoint with the DB_FORCE flag and perhaps more frequently if you would like to experiment with this approach.
following to your instructions we test different options
- a dummy write in the database on both nodes : the process is really working well and after a small time on on the MASTER site return to be CLIENT as we expected
but this action disturb the mechanism of LSN and after network reconnection the database choose as MASTER is not always the most update and we can loose some important data after the re-sync step
- a dummy write in only one process : in a large number of case the process can't detect that the other side has been reconnect and never enter in an election an stay infinitely in MASTER-MASTER as we describes in the first part of the discussion
- perform checkpoint with flag set to TXN_FORCE is working even if we are running it at only one node but I'm affraid that it will degrade the performance of transaction writing to the DB, running checkpoint force each second for example will cause o lot of overhead process in the DB degrading too much the performance of write at high rate. And we saw the same problem that it disturb the LSN mechanism and after reconnection the database choose if not always the most update
So we are stuck in our application ...
How can I assure that split-brain between the two node will always return with an election procedure assuring that the most updated database will be choose as MASTER without entering false write/or checkpoint that will disturb the most updated database
Some internal discussion about this reminded me about a much lower-overhead alternative to forcing a checkpoint. We have a rep_flush() API that simply rebroadcasts the last log record. This call will not increase the LSN or cause I/O. The rep_flush call is not documented but it is accessible to API users and its specification is:
The only other suggestion we have is to add a third electable site to your replication group. This would insulate you from the single point of failure problems you are having with your two-site replication group.
A two-site replication group is a special case for our replication implementation. If you have a single point of failure (one of the sites) you do not have a functioning replication group because you only have one site left. As you have seen, if the remaining site is not ready for replication (e.g. still in client sync) or goes down, you lose the replication group and possibly some data.
If you add a third electable site to your replication group, then when one site is unavailable you still have a functioning replication group that can hold elections and maintain more than one copy of the replicated data.
If you are absolutely confined to two sites, these are your alternatives:
1. Use 2SITE_STRICT and reduce your replication group availability to guarantee no loss of data.
2. Do not use 2SITE_STRICT and risk the loss of some data when a site rejoins the replication group.
3. If you do not use 2SITE_STRICT and you want detection of DUPMASTER (split-brain) without application activity, you will have to provide your own workaround in the form of a periodic additional replication call. Hopefully, the rep_flush() call I mentioned above will be lower overhead for you.