I am not very familiar with the internals of checkpoints or how they do their locking. But I am aware of some known performance considerations when using checkpoints with replication.
Are you using Replication Manager? If so, what is your ack_policy and ack_timeout? I'll assume you are using Replication Manager in my answer below.
I am using version 4.7.25 in a replicated (1 master 1 slave) environment. I have a main thread that reads network requests and does db->get() or db->put() as appropriate. I have a second thread that calls txn_checkpoint every minute. During a checkpoint, my put() performance falls right away, with some puts taking over 16 seconds to complete (the checkpoint runs for about 50 seconds).When you do a checkpoint on the master, it causes a checkpoint to happen on the client as well. For most ack_policies, the master needs to wait for the client's checkpoint to consider its own checkpoint durable.
Is this expected? Does checkpointing lock the environment, database or many pages at once or is all caused by heavy disk IO?
It is possible that due to things like hardware differences or workload differences, the client's checkpoint may take longer than the master's. We provide the DB_REP_CHECKPOINT_DELAY timeout to account for this difference. You set it with DB_ENV->rep_set_timeout() and its default is 30 seconds. If you are sure your client checkpoints are going to be shorter than that, you can decrease this timeout value.
A client definitely cannot apply new master transactions while it is doing its checkpoint. This means any new master transactions must wait until the client checkpoint is finished before they can be considered durable. This may explain some of your performance issue.
The environment has the following flags set: DB_AUTO_COMMIT | DB_TXN_NOSYNC and is opened with DB_CREATE | DB_INIT_LOCK | DB_INIT_LOG | DB_INIT_MPOOL | DB_INIT_TXN | DB_RECOVER | DB_THREAD|DB_INIT_REP;You are using DB_TXN_NOSYNC, which means transactions are not flushed to disk when the are committed. Depending on the size of your workload and your log buffers, you could be building up considerable activity that needs to be done during your checkpoint.
My primary database has no flags set, my secondary has DB_DUPSORT. My cache is large compared to the data in the database.
One thing worth mentioning is that the machine only has 1 disk, so my databases and log records are both sharing it.
Would setting DB_MULTIVERSION and using snapshot isolation improve throughput during checkpoints at all?We do not support replication with transaction snapshots. This is on our list of future enhancements to consider.
The checkpoint operation will go into the cache and look for pages that have been modified that are appropriate candidates for writing back to disk and writes them. The good news is this operation will create space in the cache for new pages, and will reduce recovery time. The bad news is that this will causes a flood of IO and block other behavior while this is taking place. If from check point A to checkpoint B there was very little update activity then the effect of the checkpoint can go by unnoticed. If there was heavy activity, then the effect of the checkpoint will be very noticeable. The tradeoff of not doing them is longer recovery time as well as when you run into a case where space is needed, the txn wanting this will be the one effected and you will caravan behind that. It is also worth mentioning that by using a NOSYNC variant you are pushing more work to be done during the checkpoint. A way to look at this is to view it as balancing points. Each and every application is different. I would suggest trying to adjusting checkpoint frequency, turning NOSYNC on/off, and others to find what works best for your application.
I am using the replication manager, my ack policy is DB_REPMGR_ACKS_ONE. I have not adjusted the default timeout. My slave is really just used as a backup should the master fail in some way. It receives no user traffic whilst it is acting as a slave, therefore my questions really only relate to processing performance on the master node.
I will certainly try not setting TXN_NOSYNC.
Regarding the DB_REP_CHECKPOINT_DELAY that the master waits in the checkpointing code; txn_checkpoint only seems to hold the checkpoint mutex for this duration so am I correct in assuming that it has no effect on my program's ability to update records whilst it is waiting? The master an slave are both running on identical hardware.
Regarding the DB_REP_CHECKPOINT_DELAY that the master waits in the checkpointing code; txn_checkpoint only seems to hold the checkpoint mutex for this duration so am I correct in assuming that it has no effect on my program's ability to update records whilst it is waiting?You are correct that other updates on the master are not blocked during the DB_REP_CHECKPOINT_DELAY portion of the master checkpoint.
But, if you do a master update while the actual client checkpoint is still going on during this delay, the client cannot acknowledge that master update until the client checkpoint is complete. So if the client checkpoint takes longer than DB_REP_ACK_TIMEOUT, the master update would not be considered durable and would generate a PERM_FAILED event. If your application handles PERM_FAILED events in a way that delays things, this could be contributing to your master performance delays. Even if your application does not handle PERM_FAILED events, any master updates that occur during the actual client checkpoint will wait the entire DB_REP_ACK_TIMEOUT (by default 1 second) for an ack.
I mentioned two different timeouts above: DB_REP_CHECKPOINT_DELAY and DB_REP_ACK_TIMEOUT.
For DB_REP_CHECKPOINT_DELAY, the default value of 30 seconds is quite long, but as Mike mentioned, it is very dependent on your application. If you build up lots of changes between checkpoints, you may still need all that time. Use of DB_TXN_NOSYNC further increases the amount of processing done by a checkpoint. The actual checkpoint time on the client is the major factor to determine the length of this timeout. The amount of time for a message round trip also contributes, but is probably much smaller, particularly with your sites on the same LAN.
For DB_REP_ACK_TIMEOUT, you said you are using the default timeout of 1 second. This timeout also needs to factor in the amount of time for a message round trip (commit log record from master to client, then ack from client to master) and the round trip is proportionately a much larger part of this timeout. If you can expect much faster message round trip times consistently, you can lower this. If you start seeing many PERM_FAILED events, that would be an indication that you lowered it too much. If your application doesn't handle the PERM_FAILED event, you can use Replication Manager statistics to monitor this.