This content has been marked as final. Show 3 replies
The log message means that it is taking a significant amount of time to commit write operations on the master node in the JE HA replication group. The log to rep stream time is the amount of time between when the commit log record was created, to the time it was written to the network. From the user perspective, these two are done atomically. Calling Transaction.commit() results in first writing a commit log record to the master node's log, and then JE HA tries to send that record to the appropriate replica nodes.
The threshold for deciding that this is an unexpectedly long time is controlled by the unadvertised parameter, je.rep.transferLoggingThreshold, which defaults to 5 seconds, whereas your message seems to be saying that it is taking 1.8 seconds. Perhaps you have set the logging threshold to a shorter period? At any rate, it does indicate that there is a slowdown in the system, and the message tries to give you some pointers as to what is taking a long time.
I've not found any reference about "je.rep.transferLoggingThreshold" in JavaDoc or source code (4.1.17). Perhaps in 4.1.17 has a different name and default value?
Anyway I understand your explanation but, could you give me more info about the "Total transfer time"?
This logging threshold is unadvertised; it seemed too specialized to make it a publicly documented knob. You can find its implementation in com.sleepycat.je.rep.impl.RepParams.TRANSFER_LOGGING_THRESHOLD. In JE 4.1, the default is still 5 seconds, so I'm not sure why it is printing a lag less than five seconds ... but from code inspection, I think it is a still a reliable indicator that the master is either bogged down in terms of CPU, or that the network is not accepting its attempts to write a log record into the replication stream.
The Total transfer time is the aggregated time to write all commits to the replication stream, for the life of a single network connection between a master node and its replica. For example, if a master sends 1000 transaction commits to a replica during a single network connection, this is the time that it took to write these 1000 commits to the network. If the master's connection to the replica is broken and restarted, this value would be reset to 0 when the master next establishes a connection.
Your question yesterday about this trace message caused us to have a short internal debate ourselves about the usefulness of this stat. It was put in because we found it useful at one point for a specific performance tuning and debugging exercise. It's a bit hard to understand on its own; to be useful, you need to have a pretty good idea of how many items have been transmitted in the replication stream, when master/replica connections might be severed and reconnected, etc.
So in summary, the presence of that trace message means that the master is running slowly or that the network is backed up, and that commits are taking a long time to make it to the network.