One thing I noticed is that I am not setting the following on any node:
and I wonder if that might explain this behaviour.
On a slightly different note, suppose I have N (generally 1-3) replication sites running DB 4.8, and I shut them all down and install new code which runs DB 5.2, is there anything other than:
that I need to specify in my code? I don't care about doing a live upgrade, I want to shut down all sites and bring them up simulataneously. Also is it ok to leave that line in the code when in fact I am merely restarting a process that has always used DB 5.2 ?
First, when using the DB_LEGACY, yet, it is okay to leave that code in there after the upgrade. It will be ignored when it is not needed but you do not need to change your code.
When using DB_LEGACY you do not need the DB_GROUP_CREATOR setting.
However, the one thing that isn't clear from your description is whether you're adding in a call for DB_LEGACY for all other sites in the group. For example, if you have 3 sites, say, s1, s2, s3 they all need:
// Must set local site for each site too.
So you need to indicate a full site list on all sites for DB_LEGACY.
I am setting DB_LEGACY for all sites. I loop over the list of sites, create a dbsite object and do this:
209 int eid;
212 DbSite *dbsite;
213 m_env.repmgr_site(e.host.c_str(), e.port, &dbsite, 0);
214 dbsite->set_config(DB_BOOTSTRAP_HELPER, 1);
215 dbsite->set_config(DB_LEGACY, 1);
221 m_replication_info.replica_map[eid] = e.host;
Here is a full code listing https://github.com/hypertable/hypertable/blob/master/src/cc/Hyperspace/BerkeleyDbFilesystem.cc, the relevant code is in the constructor BerkeleyDbFilesystem::BerkeleyDbFilesystem(...) starting on line 74. The same code is run on all the replication sites simultaneously.
Is there some other logging flag I should enable to get more debug info?
A couple small questions first. What priority do you set for your two client sites? Do you set some non-zero priority for at least one of them (priority is relevant to sending acknowledgements)?
The second question is if you stagger or add your sites one at a time does this problem go away even when you add the 3rd site to the group a bit later (where "a bit later" might be defined as after the 2nd site gets the STARTUPDONE event)?
These answers will help pinpoint where the problem might be.