This discussion is archived
1 Reply Latest reply: Jan 30, 2012 1:18 PM by Paula B RSS

DB_EVENT_REP_CONNECT_BROKEN event to survived replicas !

807695 Newbie
Currently Being Moderated
Hello all,
I just work on a project about High availability with the latest BerkeleyDB 5.3 and I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows... The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held. Thus, the client cannot communicate with them and gets connection refused... The exact output (with many debug flags on) is:

[1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site 10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )
(*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for DB_EVENT_REP_CONNECT_BROKEN
[1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive


My DB_CONFIG files are:

10.10.8.5
set_flags DB_TXN_NOSYNC on
set_flags DB_AUTO_COMMIT on
repmgr_site 10.10.8.7 5010
repmgr_site 10.10.8.8 5010
repmgr_site 10.10.8.5 5010 db_local_site on db_group_creator on
rep_set_priority 100

10.10.8.7
set_flags DB_TXN_NOSYNC on
set_flags DB_AUTO_COMMIT on
repmgr_site 10.10.8.5 5010 db_bootstrap_helper on
repmgr_site 10.10.8.7 5010 db_local_site on
repmgr_site 10.10.8.8 5010
rep_set_priority 100

10.10.8.8
set_flags DB_TXN_NOSYNC on
set_flags DB_AUTO_COMMIT on
repmgr_site 10.10.8.5 5010 db_bootstrap_helper on
repmgr_site 10.10.8.7 5010
repmgr_site 10.10.8.8 5010 db_local_site on
rep_set_priority 100


Any ideas about that error??

Many thanks,
Dimos.
  • 1. Re: DB_EVENT_REP_CONNECT_BROKEN event to survived replicas !
    Paula B Explorer
    Currently Being Moderated
    We need more information to understand the problem.
    I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows...
    When you say virtual IP, do you mean some host string other than 10.10.8.5, 10.10.8.7 or 10.10.8.8? If so, this could be the problem. You must refer to each site in the replication group in a consistent manner. We have no way to relate more than one host string and port to a single site.
    The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held.
    So was this your sequence of events? If it was different, please elaborate.
    1. Start up replication with site 5 master, sites 7 and 8 clients.
    2. Kill site 5, site 8 became master.
    3. Site 5 rejoins the replication group as a client.
    4. Kill site 8 (current master), sites 5 and 7 get CONNECT_BROKEN but don't start an election.
    [1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site >10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )
    (*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for >DB_EVENT_REP_CONNECT_BROKEN
    [1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive
    It looks like you have turned on verbose output. But it looks like what you display here is from more than one site.

    The first line talks about a broken connection to site 8 and you say site 8 is the master that just crashed. This implies that this line is from one of the clients (site 5 or site 7).

    The final line ("bust connection") is a message that should only be coming from a site that that thinks it's the current master. I assume this must be from site 8.

    The only other explanation is that you are using more than one host string to refer to the same site, as I mentioned above, which we don't support.

    I have one other question - are you using rep_set_config() to turn off the DB_REPMGR_CONF_ELECTIONS flag at any time? I realize this is unlikely, but it's worth ruling this out.

    Paula Bingham
    Oracle

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points