This discussion is archived
5 Replies Latest reply: Oct 24, 2011 12:12 PM by 524761 RSS

Replication master fails to get permanent message acknowledgements

674194 Newbie
Currently Being Moderated
Hi,

I am using BerkeleyDB 5.2.36 (C++ api) and for some reason when I use a 3 replication sites, the master fails to get enough acks for permanent messages. It seems to work fine with 2 replication sites and I didn't see this problem when running with BerkeleyDB 4.8. I have set the ack policy to DB_REPMGR_ACKS_QUORUM and the DB_REP_ACK_TIMEOUT and DB_REP_LEASE_TIMEOUT to 30s. Here is some of my code:

u_int32_t env_flags =
DB_CREATE | // If the environment does not exist, create it
DB_INIT_LOCK | // Initialize locking
DB_INIT_LOG | // Initialize logging
DB_INIT_MPOOL | // Initialize the cache
DB_INIT_TXN | // Initialize transactions
DB_INIT_REP | // Initialize replication
DB_RECOVER | // Do basic recovery
DB_THREAD;

if (force_recover)
env_flags |= DB_RECOVER_FATAL;

m_db_flags = DB_CREATE | DB_AUTO_COMMIT | DB_THREAD;
m_env.set_lk_detect(DB_LOCK_DEFAULT);
m_env.set_app_private(&m_replication_info);
m_env.set_event_notify(db_event_callback);
m_env.set_msgcall(db_msg_callback);
m_env.set_errcall(db_err_callback);
m_env.set_verbose(DB_VERB_REPLICATION, 1);

// open dbenv
m_env.open(m_base_dir.c_str(), env_flags, 0);

m_env.rep_set_config(DB_REP_CONF_LEASE, 1);
m_env.rep_set_config(DB_REP_CONF_BULK, 1);
m_env.rep_set_config(DB_REPMGR_CONF_2SITE_STRICT, 1);
m_env.repmgr_set_ack_policy(DB_REPMGR_ACKS_QUORUM);
m_env.rep_set_timeout(DB_REP_HEARTBEAT_SEND, 30000000); //30s
m_env.rep_set_timeout(DB_REP_HEARTBEAT_MONITOR, 60000000); //60s
m_env.rep_set_timeout(DB_REP_ACK_TIMEOUT, 30000000); //30s
m_env.rep_set_timeout(DB_REP_CONNECTION_RETRY, 5000000); //5s
m_env.rep_set_timeout(DB_REP_LEASE_TIMEOUT, 30000000); //30s

//loop over replication sites
foreach (replication_site)
if (is_localhost) {
m_env.rep_set_priority(priority);
DbSite *dbsite;
m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
dbsite->set_config(DB_LOCAL_SITE, 1);
dbsite->set_config(DB_LEGACY, 1);
dbsite->close();
}
else {
int eid;
DbSite *dbsite;
m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
dbsite->set_config(DB_BOOTSTRAP_HELPER, 1);
dbsite->set_config(DB_LEGACY, 1);
dbsite->get_eid(&eid);
dbsite->close();
}
--priority;
}
m_env.repmgr_start(3, DB_REP_ELECTION);
// block on a cond var till election is complete
m_replication_info.wait_for_initial_election();
if (is_master()) {
// do checkpoint
m_env.txn_checkpoint(0, 0, 0);
...
}

I believe the checkpoint operation triggers a permanent message and this fails to get enough acks.

Is there some change between v4.8 and v5.2 that explains this behaviour?

Looking at the logs on the master I see:

[1318981276:851356][17503/1455565120] CLIENT: New master gen 1, egen 2
[1318981276:851546][17503/1434585408] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 0 eid 1, type vote2, LSN [0][0] nogroup
[1318981276:863074][17503/1455565120] MASTER: rep_start: Old log version was 18[1318981276:863193][17503/1455565120] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type newmaster, LSN [1][28] nobuf[1318981276:863354][17503/1455565120] MASTER: restore_prep: No prepares. Skip.
[1318981276:891241][17503/1466054976] MASTER: Ended election with 0, e_th 0, egen 2, flag 0x6924, e_fl 0x0, lo_fl 0x11
[1318981276:891356][17503/1466054976] MASTER: election thread is exiting[1318981276:900353][17503/1455565120] MASTER: bulk_msg: Send buffer after copy due to PERM
[1318981276:900432][17503/1455565120] MASTER: send_bulk: Send 5476 (0x1564) bulk buffer bytes
[1318981276:900466][17503/1455565120] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type bulk_log, LSN [1][5308] flush lease perm[1318981276:910895][17503/47758366616912] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type start_sync, LSN [1][5524] nobuf[1318981276:921446][17503/1445075264] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0][1318981276:922340][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0][1318981276:923250][17503/1445075264] MASTER: Walk_dir: Getting info for dir: /opt/hypertable/sanjit/0.9.5.1/hyperspace
[1318981276:923510][17503/1445075264] MASTER: Walk_dir: Dir /opt/hypertable/sanjit/0.9.5.1/hyperspace has 10 files
[1318981276:923548][17503/1445075264] MASTER: Walk_dir: File 0 name: __db.002
[1318981276:923579][17503/1445075264] MASTER: Walk_dir: File 1 name: __db.rep.egen
[1318981276:923609][17503/1445075264] MASTER: Walk_dir: File 2 name: __db.rep.diag01
[1318981276:923638][17503/1445075264] MASTER: Walk_dir: File 3 name: lock
[1318981276:923856][17503/1445075264] MASTER: Walk_dir: File 3 lock: returned error Invalid argument
[1318981276:923911][17503/1445075264] MASTER: Walk_dir: File 4 name: __db.001
[1318981276:923941][17503/1445075264] MASTER: Walk_dir: File 5 name: __db.rep.system
[1318981276:931106][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
[1318981276:931652][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
[1318981306:912131][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
[1318981306:912294][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0]
[1318981306:933466][17503/47758366616912] MASTER: bulk_msg: Send buffer after copy due to PERM
[1318981306:933581][17503/47758366616912] MASTER: send_bulk: Send 3139 (0xc43) bulk buffer bytes
[1318981306:933619][17503/47758366616912] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid -1, type bulk_log, LSN [1][8459] flush lease perm
[1318981306:933786][17503/47758366616912] MASTER: will await acknowledgement: need 1
[1318981306:933829][17503/47758366616912] MASTER: checking perm result, 0, 0, 1
[1318981336:935208][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
[1318981336:935432][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0]
[1318981366:935353][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 1, type update_req, LSN [0][0]
[1318981366:935528][17503/1434585408] MASTER: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 2, type update_req, LSN [0][0]

At this point the master receives a DB_EVENT_REP_PERM_FAILED message.

On one of the clients I see:

[1318981277:192521][27559/1512315200] CLIENT: election thread is exiting
[1318981306:911962][27559/1459865920] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid 0, type update_req, LSN [0][0] nobuf
[1318981306:934234][27559/1449376064] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_process_message: msgv = 6 logv 18 gen = 1 eid 0, type bulk_log, LSN [1][8459] flush lease perm
[1318981336:934973][27559/1438886208] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid 0, type update_req, LSN [0][0] nobuf
[1318981366:935125][27559/1459865920] CLIENT: /opt/hypertable/sanjit/0.9.5.1/hyperspace rep_send_message: msgv = 6 logv 18 gen = 1 eid 0, type update_req, LSN [0][0] nobuf[1318981366:986131][27559/1480845632] CLIENT: EOF on connection to site test01:38041
[1318981366:986457][27559/1470355776] CLIENT: starting election thread
[1318981366:986609][27559/1470355776] CLIENT: Start election nsites 3, ack 2, priority 2
[1318981366:986657][27559/1470355776] CLIENT: wait_time: grant_expire 0 0 lease_to 60000000[1318981366:986702][27559/1470355776] CLIENT: Election thread owns egen 2
[1318981366:986748][27559/1470355776] CLIENT: Setting priority 0, unelectable, due to internal init/recovery
  • 1. Re: Replication master fails to get permanent message acknowledgements
    674194 Newbie
    Currently Being Moderated
    One thing I noticed is that I am not setting the following on any node:
    dbsite->set_config(DB_GROUP_CREATOR, 1);

    and I wonder if that might explain this behaviour.

    On a slightly different note, suppose I have N (generally 1-3) replication sites running DB 4.8, and I shut them all down and install new code which runs DB 5.2, is there anything other than:

    dbsite->set_config(DB_LEGACY, 1);

    that I need to specify in my code? I don't care about doing a live upgrade, I want to shut down all sites and bring them up simulataneously. Also is it ok to leave that line in the code when in fact I am merely restarting a process that has always used DB 5.2 ?
  • 2. Re: Replication master fails to get permanent message acknowledgements
    524722 Explorer
    Currently Being Moderated
    First, when using the DB_LEGACY, yet, it is okay to leave that code in there after the upgrade. It will be ignored when it is not needed but you do not need to change your code.

    When using DB_LEGACY you do not need the DB_GROUP_CREATOR setting.

    However, the one thing that isn't clear from your description is whether you're adding in a call for DB_LEGACY for all other sites in the group. For example, if you have 3 sites, say, s1, s2, s3 they all need:

    // Must set local site for each site too.
    s1_dbsite->set_config(DB_LEGACY, 1);
    s2_dbsite->set_config(DB_LEGACY, 1);
    s3_dbsite->set_config(DB_LEGACY, 1);

    So you need to indicate a full site list on all sites for DB_LEGACY.

    Sue LoVerso
    Oracle
  • 3. Re: Replication master fails to get permanent message acknowledgements
    674194 Newbie
    Currently Being Moderated
    Hi Sue,

    I am setting DB_LEGACY for all sites. I loop over the list of sites, create a dbsite object and do this:

    209 int eid;
    212 DbSite *dbsite;
    213 m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
    214 dbsite->set_config(DB_BOOTSTRAP_HELPER, 1);
    215 dbsite->set_config(DB_LEGACY, 1);
    216 dbsite->get_eid(&eid);
    217 dbsite->close();
    221 m_replication_info.replica_map[eid] = e.host;

    Here is a full code listing https://github.com/hypertable/hypertable/blob/master/src/cc/Hyperspace/BerkeleyDbFilesystem.cc, the relevant code is in the constructor BerkeleyDbFilesystem::BerkeleyDbFilesystem(...) starting on line 74. The same code is run on all the replication sites simultaneously.

    Is there some other logging flag I should enable to get more debug info?
  • 4. Re: Replication master fails to get permanent message acknowledgements
    524722 Explorer
    Currently Being Moderated
    A couple small questions first. What priority do you set for your two client sites? Do you set some non-zero priority for at least one of them (priority is relevant to sending acknowledgements)?

    The second question is if you stagger or add your sites one at a time does this problem go away even when you add the 3rd site to the group a bit later (where "a bit later" might be defined as after the 2nd site gets the STARTUPDONE event)?

    These answers will help pinpoint where the problem might be.

    Sue LoVerso
    Oracle
  • 5. Re: Replication master fails to get permanent message acknowledgements
    524761 Journeyer
    Currently Being Moderated
    When using DB_LEGACY, you also have to make sure that each site in the group specifies the network addresses of every other site in the group (using DbEnv::repmgr_site()).

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points