I have a question concerning the replication process, what happened when we have a 2 sites replicated database
and when the master fails and in the same time the client has not achieve its sync and have to be the new MASTER
the BDB log is writing something like this site can't be electable
but internally in the BDB API we didn't receive any event like failure/timeout or something else.
I think that in this case we should wait until that the Master restart, but in the case that it won't restart any more we should restart from
a previous backup made at the slave.
My question is how in the BDB API I can know that such case appear ?
this is the log :
[1306332357:593964][18361/1594886464] CLIENT: PAGE: Write page 2502 into mpool
[1306332357:593983][18361/1594886464] CLIENT: PAGE_GAP: pgno 2502, max_pg 20106 ready 2502, waiting 2503 max_wait 0
[1306332357:594003][18361/1594886464] CLIENT: PAGE_GAP: Set cursor for ready 2503, waiting 2503
[1306332357:594015][18361/1594886464] CLIENT: PAGE_GAP: Next cursor No next - ready 2504, waiting 0
[1306332357:594024][18361/1594886464] CLIENT: FILEDONE: have 2504 pages. Need 20107.
[1306332357:594031][18361/1594886464] CLIENT: rep_bulk_page: rep_page ret 0 No electable site found: recvd 1 of 1 votes from 1 sites
[1306332359:589408][18361/1615866176] CLIENT: Election finished in 2.015226000 sec
[1306332359:589439][18361/1615866176] CLIENT: Election done; egen 169
[1306332359:589450][18361/1615866176] CLIENT: Ended election with -30974, e_th 0, egen 169, flag 0x202c, e_fl 0x0, lo_fl 0x11
[1306332369:592356][18361/1615866176] CLIENT: /usr/local/sandvine/replica_data rep_send_message: msgv = 5 logv 17 gen = 167 eid -1, type newclient, LSN  nobuf
DB_ENV->rep_elect:WARNING: nvotes (1) is sub-majority with nsites (2)
[1306332370:594377][18361/1615866176] CLIENT: Start election nsites 2, ack 1, priority 100
[1306332370:594407][18361/1615866176] CLIENT: Election thread owns egen 169
*[1306332370:594421][18361/1615866176] CLIENT: Setting priority 0, unelectable, due to internal init/recovery*
[1306332370:596625][18361/1615866176] CLIENT: Tallying VOTE1 (2147483647, 169)
[1306332370:596640][18361/1615866176] CLIENT: Beginning an election
[1306332370:596655][18361/1615866176] CLIENT: /usr/local/bd/replica_data rep_send_message: msgv = 5 logv 17 gen = 167 eid -1, type vote1, LSN  nobuf No electable site found: recvd 1 of 1 votes from 2 sites
Thanks you in advance ...
The client is unelectable during its sync process because it is likely to be in an inconsistent state. You are correct that you should wait for the master to restart and if the master cannot restart, you need to work from a previous backup. It would be better to use a backup from the master, but if you don't have this, you could use a backup from the client.
I am assuming from your previous entry that you are using Replication Manager and BDB 5.1.19.
You can find out about the failed elections by handling the DB_EVENT_REP_ELECTION_FAILED event at either site.
You can determine when your client sync is finished by handing the DB_EVENT_REP_STARTUPDONE event at the client.
If you don't already have an event handler, you can find out more about creating an event handler in the documentation for DB_ENV->set_event_notify().
I'm not sure we can talk about a split-brain in this case. Usually a split-brain means there were two masters. In this case, your master is gone and your client is unelectable because it was in client sync. The client never became a master, so there was only one master. Of course, the client in its unelectable state is not much help to you.
I'm not sure I fully understand your question. If you are asking if we can "undo" the client sync process and get back to the point on the client before it started, I think the answer is no. During our client sync process, we remove and add data on the client to make it consistent with the master. If the master goes away in the middle of this process, it is undefined whether what is left on the client is consistent enough with itself or any previous master to undo.
If you can't bring back your master and your client was in the middle of client sync, I think your only alternative is to restore the client from a backup.