1 Reply Latest reply: Jan 30, 2012 3:18 PM by Paula B-Oracle RSS

    DB_EVENT_REP_CONNECT_BROKEN event to survived replicas !

    807695
      Hello all,
      I just work on a project about High availability with the latest BerkeleyDB 5.3 and I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows... The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held. Thus, the client cannot communicate with them and gets connection refused... The exact output (with many debug flags on) is:

      [1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site 10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )
      (*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for DB_EVENT_REP_CONNECT_BROKEN
      [1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive


      My DB_CONFIG files are:

      10.10.8.5
      set_flags DB_TXN_NOSYNC on
      set_flags DB_AUTO_COMMIT on
      repmgr_site 10.10.8.7 5010
      repmgr_site 10.10.8.8 5010
      repmgr_site 10.10.8.5 5010 db_local_site on db_group_creator on
      rep_set_priority 100

      10.10.8.7
      set_flags DB_TXN_NOSYNC on
      set_flags DB_AUTO_COMMIT on
      repmgr_site 10.10.8.5 5010 db_bootstrap_helper on
      repmgr_site 10.10.8.7 5010 db_local_site on
      repmgr_site 10.10.8.8 5010
      rep_set_priority 100

      10.10.8.8
      set_flags DB_TXN_NOSYNC on
      set_flags DB_AUTO_COMMIT on
      repmgr_site 10.10.8.5 5010 db_bootstrap_helper on
      repmgr_site 10.10.8.7 5010
      repmgr_site 10.10.8.8 5010 db_local_site on
      rep_set_priority 100


      Any ideas about that error??

      Many thanks,
      Dimos.
        • 1. Re: DB_EVENT_REP_CONNECT_BROKEN event to survived replicas !
          Paula B-Oracle
          We need more information to understand the problem.
          I have 3 replicas (10.10.8.5, 10.10.8.7, and 10.10.8.8), where we have one master at each time. The master gets a virtual IP and that's the IP the client knows...
          When you say virtual IP, do you mean some host string other than 10.10.8.5, 10.10.8.7 or 10.10.8.8? If so, this could be the problem. You must refer to each site in the replication group in a consistent manner. We have no way to relate more than one host string and port to a single site.
          The failover works only when the elected master is the group creator node. If this node fails, then another one is elected as new master and works perfectly. In other cases (when the master is not the group creator) when I kill the master, the other two replicas receive a DB_EVENT_REP_CONNECT_BROKEN event and no election procedure for a new master is being held.
          So was this your sequence of events? If it was different, please elaborate.
          1. Start up replication with site 5 master, sites 7 and 8 clients.
          2. Kill site 5, site 8 became master.
          3. Site 5 rejoins the replication group as a client.
          4. Kill site 8 (current master), sites 5 and 7 get CONNECT_BROKEN but don't start an election.
          [1327795212:363390][20270/1149090128] TROVE:DBPF:Berkeley DB: EOF on connection to site >10.10.8.8:2000 ( 10.10.8.8 is the master node that just crashed )
          (*) Debugging : BerkeleyDB_replica_state_callback : default : 4 // This is the code for >DB_EVENT_REP_CONNECT_BROKEN
          [1327795212:363620][20270/1149090128] TROVE:DBPF:Berkeley DB: Repmgr: bust connection. Block archive
          It looks like you have turned on verbose output. But it looks like what you display here is from more than one site.

          The first line talks about a broken connection to site 8 and you say site 8 is the master that just crashed. This implies that this line is from one of the clients (site 5 or site 7).

          The final line ("bust connection") is a message that should only be coming from a site that that thinks it's the current master. I assume this must be from site 8.

          The only other explanation is that you are using more than one host string to refer to the same site, as I mentioned above, which we don't support.

          I have one other question - are you using rep_set_config() to turn off the DB_REPMGR_CONF_ELECTIONS flag at any time? I realize this is unlikely, but it's worth ruling this out.

          Paula Bingham
          Oracle