File 0xf32b67: <summary totalCount="33820" totalSize="9999858" totalINCount="2890" totalINSize="6656942" totalLNCount="7348" totalLNSize="1435119" obsoleteINCount="2312" obsoleteLNCount="16" obsoleteLNSize="399" obsoleteLNSizeCounted="7"/>
On a slight tangent and more out of curiosity than anything, why does totalCount (see above) not equal the sum of the other counts?The total includes entries that aren't Btree nodes such as commits and BINDeltas. It is likely that lots of BINDeltas are being written by the checkpoint, but the DbPrintLog -S would tell us more.
Also, in the EnvironmentStats, how are random writes possible in an append only txn log?Due to a previous read. We estimate what is random vs sequential to try to estimate when the disk head would have moved. A random read or write is considered one that changes the file position by 1MB.
On node 1 we're seeing 20GB of random writes per hour compared with 3.7GB on node 2 - at that rate the entire 50GB DB gets rewritten every couple of hours... I just recounted the digits to be absolutely sure since that figure seems insane!You said earlier "node 1 is creating/deleting 10MB log files every 1.4 seconds", so that would be around 50 GB in two hours. So at least the numbers match. The issue is that checkpointing and eviction are writing, and they shouldn't be.
greybird wrote:I won't be able to obtain the last few log files for sure, but I'll push harder to get them to run DbPrintLog. I'll send the other info I have in a separate email to you.
So far I haven't been able to come up with a possible cause for this. Even if they were to spike at some point in time, normally eviction and checkpointing should both decrease over time with such a low application write rate, and should not be high in general. It would help to have the DbPrintLog -S output, if you can get it, or even better the last several log files. The DbSpace output isn't enough to know what types of metadata are being written. Please also send the stats diff over email.
Duh, of course, silly me! For some reason my brain hadn't made the obvious connection that reads use the same disk head as writes. For reduced durability settings, no-sync and write-no-sync (we use the latter by default), I assume that random writes are less of a concern since writes will be buffered and effectively grouped together?Also, in the EnvironmentStats, how are random writes possible in an append only txn log?Due to a previous read. We estimate what is random vs sequential to try to estimate when the disk head would have moved. A random read or write is considered one that changes the file position by 1MB.
Yes, and in fact it's a little worse than I first estimated. It's writing 5GB/hr sequentially as well.On node 1 we're seeing 20GB of random writes per hour compared with 3.7GB on node 2 - at that rate the entire 50GB DB gets rewritten every couple of hours... I just recounted the digits to be absolutely sure since that figure seems insane!You said earlier "node 1 is creating/deleting 10MB log files every 1.4 seconds", so that would be around 50 GB in two hours. So at least the numbers match. The issue is that checkpointing and eviction are writing, and they shouldn't be.
I assume that random writes are less of a concern since writes will be buffered and effectively grouped together?The stats you're looking at are the actual file system writes, not just additions to the log buffer. But in general, yes, using NO_SYNC durability along with a large log buffer size will reduce the amount of random IO.
To speed things up and reduce the risk of interfering too much with the production systems, would it be a good compromise to get the print log for a subset of the log files, e.g. the last 100?Sure, the output for the last 100 files would be plenty.
greybird wrote:That's right I had read about lazy migration already :-) and double checked that we had it disabled (we had via the high priority checkpointer).
I took a close look at the JE stats you sent. Thanks for creating the easy-to-read spreadsheet!
I'd like to confirm this by looking at the DbPrintLog -S output, but what I believe is happening is the result of the interaction of log cleaning and checkpointing, due primarily (I think) to the lazy migration setting (EnvironmentConfig.CLEANER_LAZY_MIGRATION) which is true by default and not overridden by your app. As you probably know, lazy migration is strongly recommended to be turned off and is off by default in JE 5. This has been discussed in several threads on the forum, but let me know if you need more info on that.
The stats don't show anything that is particularly abnormal other than the excessive cleaning and checkpointing activity. Other stats correlate with that. There isn't anything I can see that smacks of a bug, e.g., a cache leak.That's interesting because we have default settings for checkpointing, 20MB/30s IIRC, and neither are being adhered to by the looks of things! Node 1 is checkpointing once per minute and node 2 once every 2 minutes. So this looks like the Checkpointer has too much work to do and cannot complete in time or within budget (I'll avoid any obvious jokes about EU governments at this point!).
What stands out to me is the size of the checkpoints, based on EnvironmentLastCheckpointStart and EnvironmentLastCheckpointEnd. On node 1 (the misbehaving one) the two checkpoint sizes are around 350 MB, and even on node 2 (the good one) they are around 180 or 200 MB. These are very long checkpoints. Since cleaned file deletion is not done until the end of a checkpoint, there is a long lag between cleaning a file and deleting it.
What I suspect is that something happened to these older environments -- perhaps database deletions as you mentioned -- that caused a spike in log cleaning. But due to the lazy migration setting and the resulting long checkpoints, cleaning was not able to make fast enough progress to catch up. The cleaner and checkpointer have gotten into a cycle that produces waste faster or as fast as it can be cleaned.I don't think the possible DB deletions can be the cause: node 2 is slowly going the same way as node 1 and it has never experienced DB deletions or upgrades even.
This is obviously a guess, but because we have seen bad behavior like this before with lazy migration enabled, I think it's the mostly likely cause.
Since you're hestitant to upgrade to JE 5 (which would have the most benefit), you may want to first try disabling lazy migration and also increasing the number of cleaner threads to compensate. I would expect to see a noticeable improvement.I agree about JE5 but this is a production system and I don't think that the user is willing to make major changes without understanding why. Our next version of OpenDJ does use JE 5 and tests have shown significant improvements. We haven't released it yet (another reason why the user is unwilling to upgrade!).