This discussion is archived
9 Replies Latest reply: Dec 6, 2012 10:12 AM by user4547579 RSS

Large Checkpointer behavioral disparity between two "identical" DB envs

user4547579 Newbie
Currently Being Moderated
Hi there,

We have a large mature production environment which consists of 6 nodes, each having JE 4.1.20 environments of around 50G (or 25G live data @ 50% utilization). The nodes are kept in sync using our own replication protocol and contain the same application data. One of the nodes - node 2 - was taken out of service and reinitialized over 3 months ago. Since then it has grown in size until it reached the min utilization threshold (50%) where it has remained ever since. The other nodes (1 & 3 - 6) have been around much longer, probably over a year, and have been at threshold for much longer. The older nodes may have experienced a couple of JE 4.1.x upgrades, as well as perhaps some database deletions (when they removed unwanted indexes), although I cannot confirm for sure.

We have observed that each of the nodes becomes busier and busier over time (higher load averages) even though the throughput remains about the same (and pretty low to be honest - about 3.5 small transactions per second). The databases are growing steadily as a result of more users being added, so you might think that it was a simple question of the DB cache being too small (although I strongly doubt it in this case). However, this does not explain the disparity between the newer node and the older nodes: even though they have all been at log utilization threshold for quite some time we are seeing the Checkpointer thread on the older nodes constantly busy at around 50-100% CPU but almost completely inactive on node 2. We suspect that over time node 2 will suffer a similar fate but we have no idea why.

We have obtained various stats regarding two environments (1 & 2) for comparison and we can see:

* both nodes have the exact same configuration - about 25G of live data, 6G DB cache, and default JE settings except:

je.checkpointer.highPriority=true
je.log.fileCacheSize=6000 (there are ~5000x10M log files in the environment)
je.cleaner.lookAheadCacheSize=2M
je.log.iteratorReadSize=2M
je.log.faultReadSize=4K

* node 2 has a smaller transaction log - 47G vs 50G but claims to have the same utilization according to DbSpace - 49%. Could the b-tree in node 1 be some how fragmented or how more admin data overhead?

* the nodes have a very different distribution of log utilization (I can provide data + graph if needed). Of particular interest is DbSpace for node 1 shows that the last 40 or so log files all have very low utilization (0-40%) which seems very strange to me (I can provide full DbSpace output if you want), especially compared to node 2 whose last log files are all in the 70-100% range as you'd expect

* node 1 is constantly checkpointing and looks like it is writing and rewriting administrative data and BIN nodes

* node 1 cleaners are pretty quiet - I'm seeing one cleaner running at between 5-15% CPU at any point in time but that's all: nothing compared to the Checkpointer. Node 2 is comparatively silent - almost no cleaning or checkpointing taking place (or when it does, it happens so quick that it doesn't show up as CPU load)

* node 1 is creating/deleting 10MB log files every 1.4 seconds compared with one ever 7.2 seconds for node 2 for the same application write load

Are you aware of any issues like this in JE 4.1.20? I'm hesitant to advise knee-jerk responses like increasing the DB cache size or upgrading to JE 5 without really understanding the problem, since neither "solutions" really explains the disparity that we are seeing, or the excessive activity of the Checkpointer.

I have extensive stats I can provide you with if needed:

* DbSpace output for nodes 1 and 2

* environment stats taken one our apart for both nodes so it is possible to see the deltas for each node and compare

* graphs showing the utilization distribution for nodes 1 and 2

Many thanks :-)

Matt
  • 1. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    greybird Expert
    Currently Being Moderated
    Hi Matt,

    I'm still thinking about this, but I have a couple questions.

    1) Is there any noticeable difference in eviction activity (in the JE eviction stats) on node 1 and 2?

    2) I assume the problem persists after bouncing the process -- correct? That would rule out a transient, accumulating error in the JE cache budget, for example. How long do these JE Environments stay open without a restart?

    3) If you run DbPrintLog -S on the two environments I assume you'll see a much larger proportion of BIN log entries in node 1 than node 2, due to the additional checkpointing. Is that the case? Anything else that's a noticeable difference in the proportion of log entry types?

    Thanks,
    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 2. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    user4547579 Newbie
    Currently Being Moderated
    Hi Mark,

    Thanks for the quick response. Here are my answers:

    1) Is there any noticeable difference in eviction activity (in the JE eviction stats) on node 1 and 2?

    A: Yes, node 1 has much higher eviction stats, but then it has much higher everything really, other than client transactions (TransactionNCommits which are the same). Here is a selection (I have a nice spreadsheet I can send if you want which has a full set of deltas for all stats for each node during the same hour):

    Stat, Node 1 delta, Node 2 delta
    EnvironmentNEvictPasses, 35392, 6177
    EnvironmentNNodesExplicitlyEvicted, 2426950, 438850
    EnvironmentNRootNodesEvicted, 3, 0
    EnvironmentNBINsEvictedCritical, 66634, 10857
    EnvironmentNBINsEvictedDaemon, 813728, 152097
    EnvironmentNBINsEvictedEvictorThread, 1546438, 275952
    EnvironmentNUpperINsEvictedCritical, 7, 0
    EnvironmentNUpperINsEvictedDaemon, 813728, 152097
    EnvironmentNUpperINsEvictedEvictorThread, 140, 4

    2) I assume the problem persists after bouncing the process -- correct? That would rule out a transient, accumulating error in the JE cache budget, for example. How long do these JE Environments stay open without a restart?

    Yes, the problem is persistent. I don't have accurate numbers for the last question, but I know that it is of the order of months, rather than days/weeks. During the lifetime of these nodes the DB has grown in size from around 120M records to double that.

    3) If you run DbPrintLog -S on the two environments I assume you'll see a much larger proportion of BIN log entries in node 1 than node 2, due to the additional checkpointing. Is that the case? Anything else that's a noticeable difference in the proportion of log entry types?

    That would be interesting to know. I'll see if I can find out although it may be difficult as these DBs are in production. I have the DbSpace output so can I estimate the number of INs from the XML records DbSpace produces? e.g:

    File 0xf32b67: <summary totalCount="33820" totalSize="9999858" totalINCount="2890" totalINSize="6656942" totalLNCount="7348" totalLNSize="1435119" obsoleteINCount="2312" obsoleteLNCount="16" obsoleteLNSize="399" obsoleteLNSizeCounted="7"/>

    On a slight tangent and more out of curiosity than anything, why does totalCount (see above) not equal the sum of the other counts? Also, in the EnvironmentStats, how are random writes possible in an append only txn log? On node 1 we're seeing 20GB of random writes per hour compared with 3.7GB on node 2 - at that rate the entire 50GB DB gets rewritten every couple of hours... I just recounted the digits to be absolutely sure since that figure seems insane! :-)

    Matt
  • 3. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    greybird Expert
    Currently Being Moderated
    Matt,

    So far I haven't been able to come up with a possible cause for this. Even if they were to spike at some point in time, normally eviction and checkpointing should both decrease over time with such a low application write rate, and should not be high in general. It would help to have the DbPrintLog -S output, if you can get it, or even better the last several log files. The DbSpace output isn't enough to know what types of metadata are being written. Please also send the stats diff over email.
    File 0xf32b67: <summary totalCount="33820" totalSize="9999858" totalINCount="2890" totalINSize="6656942" totalLNCount="7348" totalLNSize="1435119" obsoleteINCount="2312" obsoleteLNCount="16" obsoleteLNSize="399" obsoleteLNSizeCounted="7"/>
    On a slight tangent and more out of curiosity than anything, why does totalCount (see above) not equal the sum of the other counts?
    The total includes entries that aren't Btree nodes such as commits and BINDeltas. It is likely that lots of BINDeltas are being written by the checkpoint, but the DbPrintLog -S would tell us more.
    Also, in the EnvironmentStats, how are random writes possible in an append only txn log?
    Due to a previous read. We estimate what is random vs sequential to try to estimate when the disk head would have moved. A random read or write is considered one that changes the file position by 1MB.
    On node 1 we're seeing 20GB of random writes per hour compared with 3.7GB on node 2 - at that rate the entire 50GB DB gets rewritten every couple of hours... I just recounted the digits to be absolutely sure since that figure seems insane!
    You said earlier "node 1 is creating/deleting 10MB log files every 1.4 seconds", so that would be around 50 GB in two hours. So at least the numbers match. The issue is that checkpointing and eviction are writing, and they shouldn't be.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 4. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    user4547579 Newbie
    Currently Being Moderated
    Hi Mark,

    Thanks for the response. Answers in line below:
    greybird wrote:
    Matt,

    So far I haven't been able to come up with a possible cause for this. Even if they were to spike at some point in time, normally eviction and checkpointing should both decrease over time with such a low application write rate, and should not be high in general. It would help to have the DbPrintLog -S output, if you can get it, or even better the last several log files. The DbSpace output isn't enough to know what types of metadata are being written. Please also send the stats diff over email.
    I won't be able to obtain the last few log files for sure, but I'll push harder to get them to run DbPrintLog. I'll send the other info I have in a separate email to you.

    [...]

    >
    Also, in the EnvironmentStats, how are random writes possible in an append only txn log?
    Due to a previous read. We estimate what is random vs sequential to try to estimate when the disk head would have moved. A random read or write is considered one that changes the file position by 1MB.
    Duh, of course, silly me! For some reason my brain hadn't made the obvious connection that reads use the same disk head as writes. For reduced durability settings, no-sync and write-no-sync (we use the latter by default), I assume that random writes are less of a concern since writes will be buffered and effectively grouped together?
    On node 1 we're seeing 20GB of random writes per hour compared with 3.7GB on node 2 - at that rate the entire 50GB DB gets rewritten every couple of hours... I just recounted the digits to be absolutely sure since that figure seems insane!
    You said earlier "node 1 is creating/deleting 10MB log files every 1.4 seconds", so that would be around 50 GB in two hours. So at least the numbers match. The issue is that checkpointing and eviction are writing, and they shouldn't be.
    Yes, and in fact it's a little worse than I first estimated. It's writing 5GB/hr sequentially as well.

    Anyway, as I said, I'll mail you the more detailed stats and I'll push harder to get the DbPrintLog. To speed things up and reduce the risk of interfering too much with the production systems, would it be a good compromise to get the print log for a subset of the log files, e.g. the last 100?

    Cheers,

    Matt
  • 5. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    greybird Expert
    Currently Being Moderated
    I assume that random writes are less of a concern since writes will be buffered and effectively grouped together?
    The stats you're looking at are the actual file system writes, not just additions to the log buffer. But in general, yes, using NO_SYNC durability along with a large log buffer size will reduce the amount of random IO.
    To speed things up and reduce the risk of interfering too much with the production systems, would it be a good compromise to get the print log for a subset of the log files, e.g. the last 100?
    Sure, the output for the last 100 files would be plenty.

    Thanks,
    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 6. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    greybird Expert
    Currently Being Moderated
    I meant to add earlier that I strongly suspect this problem would not occur in JE 5, but unfortunately I don't have any specific reasons to give you for that, other than that we've done a large amount of optimization to reduce writing in JE 5, along with more extensive performance testing. But I realize you're looking for a more specific answer.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 7. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    greybird Expert
    Currently Being Moderated
    Matt,

    I took a close look at the JE stats you sent. Thanks for creating the easy-to-read spreadsheet!

    I'd like to confirm this by looking at the DbPrintLog -S output, but what I believe is happening is the result of the interaction of log cleaning and checkpointing, due primarily (I think) to the lazy migration setting (EnvironmentConfig.CLEANER_LAZY_MIGRATION) which is true by default and not overridden by your app. As you probably know, lazy migration is strongly recommended to be turned off and is off by default in JE 5. This has been discussed in several threads on the forum, but let me know if you need more info on that.

    The stats don't show anything that is particularly abnormal other than the excessive cleaning and checkpointing activity. Other stats correlate with that. There isn't anything I can see that smacks of a bug, e.g., a cache leak.

    What stands out to me is the size of the checkpoints, based on EnvironmentLastCheckpointStart and EnvironmentLastCheckpointEnd. On node 1 (the misbehaving one) the two checkpoint sizes are around 350 MB, and even on node 2 (the good one) they are around 180 or 200 MB. These are very long checkpoints. Since cleaned file deletion is not done until the end of a checkpoint, there is a long lag between cleaning a file and deleting it.

    What I suspect is that something happened to these older environments -- perhaps database deletions as you mentioned -- that caused a spike in log cleaning. But due to the lazy migration setting and the resulting long checkpoints, cleaning was not able to make fast enough progress to catch up. The cleaner and checkpointer have gotten into a cycle that produces waste faster or as fast as it can be cleaned.

    This is obviously a guess, but because we have seen bad behavior like this before with lazy migration enabled, I think it's the mostly likely cause.

    Since you're hestitant to upgrade to JE 5 (which would have the most benefit), you may want to first try disabling lazy migration and also increasing the number of cleaner threads to compensate. I would expect to see a noticeable improvement.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  • 8. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    greybird Expert
    Currently Being Moderated
    I just realized you're setting high priority checkpoints to true, which disables lazy migration. I'll have to rethink and take a look at the DbPrintLog output.

    --mark                                                                                                                                                                                                                                                                                                                                                   
  • 9. Re: Large Checkpointer behavioral disparity between two "identical" DB envs
    user4547579 Newbie
    Currently Being Moderated
    Hi Mark,

    [I've just read your next post re the high priority checkpointer as well]

    Comments below:
    greybird wrote:
    Matt,

    I took a close look at the JE stats you sent. Thanks for creating the easy-to-read spreadsheet!

    I'd like to confirm this by looking at the DbPrintLog -S output, but what I believe is happening is the result of the interaction of log cleaning and checkpointing, due primarily (I think) to the lazy migration setting (EnvironmentConfig.CLEANER_LAZY_MIGRATION) which is true by default and not overridden by your app. As you probably know, lazy migration is strongly recommended to be turned off and is off by default in JE 5. This has been discussed in several threads on the forum, but let me know if you need more info on that.
    That's right I had read about lazy migration already :-) and double checked that we had it disabled (we had via the high priority checkpointer).

    The stats don't show anything that is particularly abnormal other than the excessive cleaning and checkpointing activity. Other stats correlate with that. There isn't anything I can see that smacks of a bug, e.g., a cache leak.

    What stands out to me is the size of the checkpoints, based on EnvironmentLastCheckpointStart and EnvironmentLastCheckpointEnd. On node 1 (the misbehaving one) the two checkpoint sizes are around 350 MB, and even on node 2 (the good one) they are around 180 or 200 MB. These are very long checkpoints. Since cleaned file deletion is not done until the end of a checkpoint, there is a long lag between cleaning a file and deleting it.
    That's interesting because we have default settings for checkpointing, 20MB/30s IIRC, and neither are being adhered to by the looks of things! Node 1 is checkpointing once per minute and node 2 once every 2 minutes. So this looks like the Checkpointer has too much work to do and cannot complete in time or within budget (I'll avoid any obvious jokes about EU governments at this point!).
    What I suspect is that something happened to these older environments -- perhaps database deletions as you mentioned -- that caused a spike in log cleaning. But due to the lazy migration setting and the resulting long checkpoints, cleaning was not able to make fast enough progress to catch up. The cleaner and checkpointer have gotten into a cycle that produces waste faster or as fast as it can be cleaned.

    This is obviously a guess, but because we have seen bad behavior like this before with lazy migration enabled, I think it's the mostly likely cause.
    I don't think the possible DB deletions can be the cause: node 2 is slowly going the same way as node 1 and it has never experienced DB deletions or upgrades even.

    The only time I have seen this sort of thing happen in the past is when the DB cache is extremely small - too small for the cleaner/checkpointer working set. In this case the cleaners fight each other for space and cause their own working set (incl. BINs) to be evicted only to be immediately dirtied... infinite loop results. This is really only in extreme cases when the DB cache is <100MB and seems mostly independent of the DB size since it is related to the cleaner working set size.
    Since you're hestitant to upgrade to JE 5 (which would have the most benefit), you may want to first try disabling lazy migration and also increasing the number of cleaner threads to compensate. I would expect to see a noticeable improvement.
    I agree about JE5 but this is a production system and I don't think that the user is willing to make major changes without understanding why. Our next version of OpenDJ does use JE 5 and tests have shown significant improvements. We haven't released it yet (another reason why the user is unwilling to upgrade!).

    When I get the DbPrintLog output I'll let you know.

    Cheers,

    Matt

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points