This content has been marked as final. Show 1 reply
We need more information to understand the problem.
I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows...When you say virtual IP, do you mean some host string other than 10.10.8.5, 10.10.8.7 or 10.10.8.8? If so, this could be the problem. You must refer to each site in the replication group in a consistent manner. We have no way to relate more than one host string and port to a single site.
The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held.So was this your sequence of events? If it was different, please elaborate.
1. Start up replication with site 5 master, sites 7 and 8 clients.
2. Kill site 5, site 8 became master.
3. Site 5 rejoins the replication group as a client.
4. Kill site 8 (current master), sites 5 and 7 get CONNECT_BROKEN but don't start an election.
[1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site >10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )It looks like you have turned on verbose output. But it looks like what you display here is from more than one site.
(*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for >DB_EVENT_REP_CONNECT_BROKEN
[1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive
The first line talks about a broken connection to site 8 and you say site 8 is the master that just crashed. This implies that this line is from one of the clients (site 5 or site 7).
The final line ("bust connection") is a message that should only be coming from a site that that thinks it's the current master. I assume this must be from site 8.
The only other explanation is that you are using more than one host string to refer to the same site, as I mentioned above, which we don't support.
I have one other question - are you using rep_set_config() to turn off the DB_REPMGR_CONF_ELECTIONS flag at any time? I realize this is unlikely, but it's worth ruling this out.