I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows...When you say virtual IP, do you mean some host string other than 10.10.8.5, 10.10.8.7 or 10.10.8.8? If so, this could be the problem. You must refer to each site in the replication group in a consistent manner. We have no way to relate more than one host string and port to a single site.
The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held.So was this your sequence of events? If it was different, please elaborate.
[1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site >10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )It looks like you have turned on verbose output. But it looks like what you display here is from more than one site.
(*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for >DB_EVENT_REP_CONNECT_BROKEN
[1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive