5 Replies Latest reply: Oct 24, 2011 2:12 PM by 524761 RSS

    Replication master fails to get permanent message acknowledgements

    674194
      Hi,

      I am using BerkeleyDB 5.2.36 (C++ api) and for some reason when I use a 3 replication sites, the master fails to get enough acks for permanent messages. It seems to work fine with 2 replication sites and I didn't see this problem when running with BerkeleyDB 4.8. I have set the ack policy to DB_REPMGR_ACKS_QUORUM and the DB_REP_ACK_TIMEOUT and DB_REP_LEASE_TIMEOUT to 30s. Here is some of my code:

      u_int32_t env_flags =
      DB_CREATE | // If the environment does not exist, create it
      DB_INIT_LOCK | // Initialize locking
      DB_INIT_LOG | // Initialize logging
      DB_INIT_MPOOL | // Initialize the cache
      DB_INIT_TXN | // Initialize transactions
      DB_INIT_REP | // Initialize replication
      DB_RECOVER | // Do basic recovery
      DB_THREAD;

      if (force_recover)
      env_flags |= DB_RECOVER_FATAL;

      m_db_flags = DB_CREATE | DB_AUTO_COMMIT | DB_THREAD;
      m_env.set_lk_detect(DB_LOCK_DEFAULT);
      m_env.set_app_private(&m_replication_info);
      m_env.set_event_notify(db_event_callback);
      m_env.set_msgcall(db_msg_callback);
      m_env.set_errcall(db_err_callback);
      m_env.set_verbose(DB_VERB_REPLICATION, 1);

      // open dbenv
      m_env.open(m_base_dir.c_str(), env_flags, 0);

      m_env.rep_set_config(DB_REP_CONF_LEASE, 1);
      m_env.rep_set_config(DB_REP_CONF_BULK, 1);
      m_env.rep_set_config(DB_REPMGR_CONF_2SITE_STRICT, 1);
      m_env.repmgr_set_ack_policy(DB_REPMGR_ACKS_QUORUM);
      m_env.rep_set_timeout(DB_REP_HEARTBEAT_SEND, 30000000); //30s
      m_env.rep_set_timeout(DB_REP_HEARTBEAT_MONITOR, 60000000); //60s
      m_env.rep_set_timeout(DB_REP_ACK_TIMEOUT, 30000000); //30s
      m_env.rep_set_timeout(DB_REP_CONNECTION_RETRY, 5000000); //5s
      m_env.rep_set_timeout(DB_REP_LEASE_TIMEOUT, 30000000); //30s

      //loop over replication sites
      foreach (replication_site)
      if (is_localhost) {
      m_env.rep_set_priority(priority);
      DbSite *dbsite;
      m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
      dbsite->set_config(DB_LOCAL_SITE, 1);
      dbsite->set_config(DB_LEGACY, 1);
      dbsite->close();
      }
      else {
      int eid;
      DbSite *dbsite;
      m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
      dbsite->set_config(DB_BOOTSTRAP_HELPER, 1);
      dbsite->set_config(DB_LEGACY, 1);
      dbsite->get_eid(&eid);
      dbsite->close();
      }
      --priority;
      }
      m_env.repmgr_start(3, DB_REP_ELECTION);
      // block on a cond var till election is complete
      m_replication_info.wait_for_initial_election();
      if (is_master()) {
      // do checkpoint
      m_env.txn_checkpoint(0, 0, 0);
      ...
      }

      I believe the checkpoint operation triggers a permanent message and this fails to get enough acks.

      Is there some change between v4.8 and v5.2 that explains this behaviour?

      Looking at the logs on the master I see:

      [1318981276:851356][17503/1455565120] CLIENT: New master gen 1, egen 2
      [1318981276:851546][17503/1434585408] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 0 eid 1, type vote2, LSN [0][0] nogroup
      [1318981276:863074][17503/1455565120] MASTER: rep_start: Old log version was 18[1318981276:863193][17503/1455565120] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type newmaster, LSN [1][28] nobuf[1318981276:863354][17503/1455565120] MASTER: restore_prep: No prepares. Skip.
      [1318981276:891241][17503/1466054976] MASTER: Ended election with 0, e_th 0, egen 2, flag 0x6924, e_fl 0x0, lo_fl 0x11
      [1318981276:891356][17503/1466054976] MASTER: election thread is exiting[1318981276:900353][17503/1455565120] MASTER: bulk_msg: Send buffer after copy due to PERM
      [1318981276:900432][17503/1455565120] MASTER: send_bulk: Send 5476 (0x1564) bulk buffer bytes
      [1318981276:900466][17503/1455565120] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type bulk_log, LSN [1][5308] flush lease perm[1318981276:910895][17503/47758366616912] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type start_sync, LSN [1][5524] nobuf[1318981276:921446][17503/1445075264] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0][1318981276:922340][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0][1318981276:923250][17503/1445075264] MASTER: Walk_dir: Getting info for dir: /opt/hypertable/sanjit/0.9.5.1/hyperspace
      [1318981276:923510][17503/1445075264] MASTER: Walk_dir: Dir /opt/hypertable/sanjit/0.9.5.1/hyperspace has 10 files
      [1318981276:923548][17503/1445075264] MASTER: Walk_dir: File 0 name: __db.002
      [1318981276:923579][17503/1445075264] MASTER: Walk_dir: File 1 name: __db.rep.egen
      [1318981276:923609][17503/1445075264] MASTER: Walk_dir: File 2 name: __db.rep.diag01
      [1318981276:923638][17503/1445075264] MASTER: Walk_dir: File 3 name: lock
      [1318981276:923856][17503/1445075264] MASTER: Walk_dir: File 3 lock: returned error Invalid argument
      [1318981276:923911][17503/1445075264] MASTER: Walk_dir: File 4 name: __db.001
      [1318981276:923941][17503/1445075264] MASTER: Walk_dir: File 5 name: __db.rep.system
      [1318981276:931106][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
      [1318981276:931652][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
      [1318981306:912131][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
      [1318981306:912294][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0]
      [1318981306:933466][17503/47758366616912] MASTER: bulk_msg: Send buffer after copy due to PERM
      [1318981306:933581][17503/47758366616912] MASTER: send_bulk: Send 3139 (0xc43) bulk buffer bytes
      [1318981306:933619][17503/47758366616912] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type bulk_log, LSN [1][8459] flush lease perm
      [1318981306:933786][17503/47758366616912] MASTER: will await acknowledgement: need 1
      [1318981306:933829][17503/47758366616912] MASTER: checking perm result, 0, 0, 1
      [1318981336:935208][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
      [1318981336:935432][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0]
      [1318981366:935353][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
      [1318981366:935528][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0]

      At this point the master receives a DB_EVENT_REP_PERM_FAILED message.

      On one of the clients I see:

      [1318981277:192521][27559/1512315200] CLIENT: election thread is exiting
      [1318981306:911962][27559/1459865920] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid 0, type update_req, LSN [0][0] nobuf
      [1318981306:934234][27559/1449376064] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 0, type bulk_log, LSN [1][8459] flush lease perm
      [1318981336:934973][27559/1438886208] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid 0, type update_req, LSN [0][0] nobuf
      [1318981366:935125][27559/1459865920] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid 0, type update_req, LSN [0][0] nobuf[1318981366:986131][27559/1480845632] CLIENT: EOF on connection to site test01:38041
      [1318981366:986457][27559/1470355776] CLIENT: starting election thread
      [1318981366:986609][27559/1470355776] CLIENT: Start election nsites 3, ack 2, priority 2
      [1318981366:986657][27559/1470355776] CLIENT: wait_time: grant_expire 0 0 lease_to 60000000[1318981366:986702][27559/1470355776] CLIENT: Election thread owns egen 2
      [1318981366:986748][27559/1470355776] CLIENT: Setting priority 0, unelectable, due to internal init/recovery
        • 1. Re: Replication master fails to get permanent message acknowledgements
          674194
          One thing I noticed is that I am not setting the following on any node:
          dbsite->set_config(DB_GROUP_CREATOR, 1);

          and I wonder if that might explain this behaviour.

          On a slightly different note, suppose I have N (generally 1-3) replication sites running DB 4.8, and I shut them all down and install new code which runs DB 5.2, is there anything other than:

          dbsite->set_config(DB_LEGACY, 1);

          that I need to specify in my code? I don't care about doing a live upgrade, I want to shut down all sites and bring them up simulataneously. Also is it ok to leave that line in the code when in fact I am merely restarting a process that has always used DB 5.2 ?
          • 2. Re: Replication master fails to get permanent message acknowledgements
            524722
            First, when using the DB_LEGACY, yet, it is okay to leave that code in there after the upgrade. It will be ignored when it is not needed but you do not need to change your code.

            When using DB_LEGACY you do not need the DB_GROUP_CREATOR setting.

            However, the one thing that isn't clear from your description is whether you're adding in a call for DB_LEGACY for all other sites in the group. For example, if you have 3 sites, say, s1, s2, s3 they all need:

            // Must set local site for each site too.
            s1_dbsite->set_config(DB_LEGACY, 1);
            s2_dbsite->set_config(DB_LEGACY, 1);
            s3_dbsite->set_config(DB_LEGACY, 1);

            So you need to indicate a full site list on all sites for DB_LEGACY.

            Sue LoVerso
            Oracle
            • 3. Re: Replication master fails to get permanent message acknowledgements
              674194
              Hi Sue,

              I am setting DB_LEGACY for all sites. I loop over the list of sites, create a dbsite object and do this:

              209 int eid;
              212 DbSite *dbsite;
              213 m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
              214 dbsite->set_config(DB_BOOTSTRAP_HELPER, 1);
              215 dbsite->set_config(DB_LEGACY, 1);
              216 dbsite->get_eid(&eid);
              217 dbsite->close();
              221 m_replication_info.replica_map[eid] = e.host;

              Here is a full code listing https://github.com/hypertable/hypertable/blob/master/src/cc/Hyperspace/BerkeleyDbFilesystem.cc, the relevant code is in the constructor BerkeleyDbFilesystem::BerkeleyDbFilesystem(...) starting on line 74. The same code is run on all the replication sites simultaneously.

              Is there some other logging flag I should enable to get more debug info?
              • 4. Re: Replication master fails to get permanent message acknowledgements
                524722
                A couple small questions first. What priority do you set for your two client sites? Do you set some non-zero priority for at least one of them (priority is relevant to sending acknowledgements)?

                The second question is if you stagger or add your sites one at a time does this problem go away even when you add the 3rd site to the group a bit later (where "a bit later" might be defined as after the 2nd site gets the STARTUPDONE event)?

                These answers will help pinpoint where the problem might be.

                Sue LoVerso
                Oracle
                • 5. Re: Replication master fails to get permanent message acknowledgements
                  524761
                  When using DB_LEGACY, you also have to make sure that each site in the group specifies the network addresses of every other site in the group (using DbEnv::repmgr_site()).