This content has been marked as final. Show 3 replies
What version of Berkeley DB are you using? Are you using Replication Manager or the replication Base API?
Are all of your clients (slaves) on different machines from each other and from the master?
Is there anything different about client5 (your 5th client) or the machine it is on? Are communications set up the same?
One thing you can investigate is to see if the machine/environment you were using for client5 still operates slowly if you make it your 4th client instead. To do this, you would leave your original client4 out of your replication group and only start up client1, client2, client3 and client5 and then see if client5 is still much slower than the other clients. This will help us isolate whether this is an issue specific to client5 or an issue concerning load on the master. Let us know the results of this experiment.
I'm using Replication Manager with Berkeley DB 5.2.
The clients are on diferent machines from each other and the master. They all use the same config, except the remote site list. There are 6 machines in the LAN with 100Mbps links.
And, all clients know each other when startup.
When I start all 5 clients one by one, they are not slow.
I have tried several times start all 5 clients by the same time. client5 is not aways slow, maybe client4 or client3.
I use gdb, found that ACK is time out. Should I use client-to-client sync? How to do that?
By the way, I found that the data.db sometimes is cleaned to 0KB and restart sync all db data.Is that ok?
Your reply does seem to indicate that the master is overloaded. We have several additional questions for you. I'll list them first, and then provide more explanations below.
Have you explicitly set an ack policy, and if so which one?
How do you handle log cleanup, particularly on the master?
Are you seeing the slow client when starting up the replication group for the first time or when starting up a replication group that already existed? If it is the first startup, did you specify the master as the DB_GROUP_CREATOR and then connect the other sites with DB_BOOTSTRAP_HELPER? Is every site connected to all other sites as a DB_BOOTSTRAP_HELPER?
In your first post, you said the slow client was 10 bytes per second. What was it doing when you measured that? Was it still syncing with the master? Or had its sync completed so that it was processing log records from the master?
You said that sometimes data.db is cleaned out to 0KB and sync is restarted. Can you tell us more about when you notice this happening? Had the client already completed a previous sync? Had there been any other changes to the replication group or a change of master? Had you recently cleaned up or archived log files on the master?
And just to clarify, when you see a slow client, do the other clients continue to run at the expected speed, or do the other clients slow down at that point too?
Here are some explanations:
Your ack policy affects how likely you are to see ack timeouts. It also can affect how long it takes a replication group to start up initially because internal txns to add sites to the replication group also use the ack policy.
There are a few possible explanations for data.db getting cleaned out and having the sync restart. One is that the master has changed and that client needs to sync with a new master. Another is that you have removed master log files that the slow client still needs. One way to make sure you keep all the log files needed by all the clients in your replication group is to use DB_LOG_AUTO_REMOVE (see DB_ENV->log_set_config().)
You asked about c2c (client-to-client). This is a strategy for reducing master load when many clients are syncing at the same time. You set it by specifying the DB_REPMGR_PEER flag to DB_SITE->set_config(). You need to plan your startup sequence carefully, though. A client that is going to serve other clients as a peer needs to have completed its sync before other clients can sync from it.