This discussion is archived
1 2 Previous Next 16 Replies Latest reply: Aug 7, 2012 3:37 PM by 909806 RSS

Reducing shutdown time

vinothchandar Newbie
Currently Being Moderated
Hi,

We experience 5 minute shutdown times on 10GB BDB (version 4.1.17), and trying to see if we can reduce this time to under 2 minutes
I have looked around and did not find anything specifically of help. So questions along this line

-- What exactly causes this delay during shutdown? I assume its mainly writing back all the dirtied nodes back to disk.
-- Does some log cleaning also happen to improve space utilization? In most cases, we bring up environment right away during a server bounce. Hence, such an overhead is actually not helping.

In other words, we just want to flush all dirtied nodes in memory changes to disk, during shutdown and be able to get back up again fast. Is there some option/configuration we can leverage?

Thanks
Vinoth
  • 1. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    Hi Vinoth.

    Please take some thread dumps during the shutdown, to see where the bulk of the 5 min is being spent. With that information, I can give you better advice. It may not be log cleaning, it may be the final checkpoint.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 2. Re: Reducing shutdown time
    vinothchandar Newbie
    Currently Being Moderated
    Sure will do and get back to you.
  • 3. Re: Reducing shutdown time
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Among threads that are RUNNABLE, I see a lot of

    java.lang.Thread.State: RUNNABLE
    at java.util.HashMap.removeEntryForKey(HashMap.java:548)
    at java.util.HashMap$HashIterator.remove(HashMap.java:814)
    at com.sleepycat.je.recovery.DirtyINMap.removeNextNode(DirtyINMap.java:305)
    at com.sleepycat.je.recovery.Checkpointer.flushDirtyNodes(Checkpointer.java:685)
    at com.sleepycat.je.recovery.Checkpointer.doCheckpoint(Checkpointer.java:476)
    - locked <0x00000003111b8e90> (a com.sleepycat.je.recovery.Checkpointer)
    at com.sleepycat.je.dbi.EnvironmentImpl.invokeCheckpoint(EnvironmentImpl.java:1778)
    at com.sleepycat.je.dbi.EnvironmentImpl.doClose(EnvironmentImpl.java:1527)
    - locked <0x00000002fd44fe00> (a com.sleepycat.je.dbi.EnvironmentImpl)
    at com.sleepycat.je.dbi.DbEnvPool.closeEnvironment(DbEnvPool.java:299

    All the sleepycat related entries are doCheckpoint() children. Looks like its the final checkpoint that is hurting. Can you please shed some light on this?
  • 4. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    Hi Vinoth,

    I can believe that the checkpoint may take a while. But it is strange if you see that exact stack trace often -- removing an item from the hash map in the dirty set -- but maybe that's not what you mean?

    The last checkpoint won't be very different than other checkpoints, so the thing to look at is why checkpoints are taking a long time in general.

    Are you using cleaner lazy migration (which is on by default in JE 4.1 and earlier)? If so, that's the first thing to do: turn it off. With lazy migration, much of the work of log cleaning is done by the checkpoint.

    Also please look at your configured checkpoint interval (env config), and the actual checkpoint interval -- "DbPrintLog -S" will list the checkpoint intervals, and you can use the "-s 0xFILENUMBER" option to only look at the last 10 or 20 files (to speed up the run).

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 5. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    Vinoth, I'm very interested in whether you often see the DirtyINMap.removeNextNode method in your thread dumps. Sam pointed out an efficiency in that method, and I wonder if it is having an impact.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 6. Re: Reducing shutdown time
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Yes actually I see it as often as removeEntryForKey.

    This is a problem with our old je 4.0.92 servers, where the cleaner_lazy_migration did not exist I think. What is the behaviour on those versions? I will be upgrading our server to 4.1.17 shortly and turning off lazy_cleaner_migration though.

    Thanks
    Vinoth
  • 7. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    Yes actually I see it as often as removeEntryForKey.
    Sorry, my question wasn't clear. If you take a number of thread dumps, how often do you see this method on the stack (including methods it calls of course)? Earlier you implied that it was often, but you didn't actually say that, or how often. The majority of the time? Also note that this is not related to cleaner lazy migration.

    The cleaner lazy migration feature did exist in JE 4.0.92, but was always on (there was no config option). To be more accurate, it was always on when high priority checkpointing was off, which is the default. The lazy migration config option was added in 4.0.117.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 8. Re: Reducing shutdown time
    vinothchandar Newbie
    Currently Being Moderated
    Hi mark,

    I see this 18/21 times in the threaddump

    at java.util.HashMap.removeEntryForKey(HashMap.java:548)
    at java.util.HashMap$HashIterator.remove(HashMap.java:814)
    at com.sleepycat.je.recovery.DirtyINMap.removeNextNode(DirtyINMap.java:305)
    at com.sleepycat.je.recovery.Checkpointer.flushDirtyNodes(Checkpointer.java:685)
    at com.sleepycat.je.recovery.Checkpointer.doCheckpoint(Checkpointer.java:476)
    - locked <0x00000003111b8e90> (a com.sleepycat.je.recovery.Checkpointer)
    at com.sleepycat.je.dbi.EnvironmentImpl.invokeCheckpoint(EnvironmentImpl.java:1778)
    at com.sleepycat.je.dbi.EnvironmentImpl.doClose(EnvironmentImpl.java:1527)
    - locked <0x00000002fd44fe00> (a com.sleepycat.je.dbi.EnvironmentImpl)
    at com.sleepycat.je.dbi.DbEnvPool.closeEnvironment(DbEnvPool.java:299)
    - locked <0x00000002fae2e518> (a com.sleepycat.je.dbi.DbEnvPool)
    at com.sleepycat.je.dbi.EnvironmentImpl.close(EnvironmentImpl.java:1426)
    at com.sleepycat.je.Environment.closeInternal(Environment.java:461)
    - locked <0x00000002fb070d30> (a com.sleepycat.je.Environment)
    at com.sleepycat.je.Environment.close(Environment.java:322)
    - locked <0x00000002fb070d30> (a com.sleepycat.je.Environment)
    at voldemort.store.bdb.BdbStorageConfiguration.close(BdbStorageConfiguration.java:239)
    - locked <0x00000002fae2e530> (a java.lang.Object)
    at voldemort.server.storage.StorageService.stopInner(StorageService.java:649)
    at voldemort.server.AbstractService.stop(AbstractService.java:73)
    - locked <0x00000002fae37728> (a voldemort.server.storage.StorageService)
    at voldemort.server.VoldemortServer.stopInner(VoldemortServer.java:270)
    at voldemort.server.AbstractService.stop(AbstractService.java:73)

    Do you expect cleaner lazy migration=off will significantly improve the shutdown time?
  • 9. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    I see this 18/21 times in the threaddump
    Thanks. I'll look into improving that method.
    Do you expect cleaner lazy migration=off will significantly improve the shutdown time?
    Yes. I'm not sure how to say that emphatically enough. Lazy migration was useful in the distant past, but slows down checkpoints significantly, which has all kinds of negative side effects. It should no longer be used. I wish we changed the config setting default to false much sooner than we did (in JE 5).

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 10. Re: Reducing shutdown time
    vinothchandar Newbie
    Currently Being Moderated
    Thanks mark. Will try that.

    What you say totally makes sense. I already had an intuition that it was the lazy cleaning. But just wanted to hear from the Pro :) .

    Thanks for all your tireless help!

    Cheers
    Vinoth
  • 11. Re: Reducing shutdown time
    909806 Newbie
    Currently Being Moderated
    I've noticed this frequently as well (DirtyINMap.removeNextNode() in stacktraces during checkpointing), and I have long checkpointing times, so I'd be very interested in any fix you come up with for this. (This is under JE 4.1.17.)

    I also saw in the implementation that DirtyINMap.removeNextNode() allocates an iterator for one-time use on each call, but I don't think that's the inefficiency, since for me these stacktraces are always inside java.util.HashMap$HashIterator.remove(). That implies to me that we are in a case in which not much else is happening during the loop, which I think means that we are failing the currentLevelVal <= maxFlushLevel test in Checkpointer.flushDirtyNodes(). I wonder if that test could somehow be done above the removeNextNode call. Just a thought.

    Also, just to clarify: Turning off cleaner lazy migration helps with checkpointing times, but it's a separate issue from this, right? So in the case of lazy migration slowing down checkpointing, is there a distinctive way that would show up in stacktraces?

    Thanks much.
  • 12. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    Thanks for the additional info on the HashMap issue. I'll be looking into that next week and will report back here.

    Right, they're completely separate issues.

    With lazy migration, you'll often see Cleaner.lazyMigrateLNs on the stack, called in the checkpointer thread.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 13. Re: Reducing shutdown time
    909806 Newbie
    Currently Being Moderated
    One more observation (since I just had the opportunity to watch this again carefully): During the checkpointing, though the stacktraces almost always lead to this HashMap method, it is managing to write to the filesystem the whole time (at a gradually decreasing rate). That is, there are no noticeable periods during which it stops writing, via monitoring of I/O activity at the OS level. That seems to imply that it's not iterating through an entire level with no work to do. Thought I'd pass this along, in case it tells you something.
  • 14. Re: Reducing shutdown time
    greybird Expert
    Currently Being Moderated
    Jeffstyr,

    Thanks. We are testing a fix (Charles Lamb is doing this) and will be getting back to you shortly. The problem is an inefficiency with the removal of elements from a large HashMap and we believe the solution is simply to replace the HashMap with a TreeMap. If you'd like to try that out yourself, as an additional test to confirm this, you can do the following (this is from Charlie):

    Go to com.sleepycat.je.recovery.DirtyINMap.addIN, and change the assignment to nodeMap from an HashMap to a TreeMap (i.e. s/Hash/Tree/).

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points