This discussion is archived
10 Replies Latest reply: Sep 7, 2012 7:20 AM by greybird RSS

Cleaner apparently doing opposite of intention

943066 Newbie
Currently Being Moderated
Hi all,

I'm currently observing some very strange cleaner behaviour with BDB JE 5.0.55. Basically, my environment size has recently been getting unexpectedly large. When I shut down my DB to run some diagnostics, it shrank from ~80GB to 40GB upon graceful close.

I took a copy to diagnose (I suspected something may have been going on internally), however, when I restarted my server (and reopened the env), I noticed there was a sizeable cleaner backlog.

Upon cleaning, the log steadily got larger (rather than smaller) as the cleaner backlog approaches 0. File deletion backlog is also 0, see below for grepped output:

-----
     cleanerBackLog=695
     totalLogSize=44,222,447,351
     cleanerBackLog=692
     totalLogSize=44,007,839,334
     cleanerBackLog=677
     totalLogSize=44,161,745,419
     cleanerBackLog=683
     totalLogSize=44,489,202,186
     cleanerBackLog=663
     totalLogSize=44,822,143,619
     cleanerBackLog=637
     totalLogSize=44,916,840,541
     cleanerBackLog=604
     totalLogSize=44,548,090,588
     cleanerBackLog=583
     totalLogSize=45,535,576,874
     cleanerBackLog=559
     totalLogSize=46,309,369,095
     cleanerBackLog=544
     totalLogSize=47,198,028,965
     cleanerBackLog=536
     totalLogSize=48,132,504,517
     cleanerBackLog=521
     totalLogSize=48,817,762,947
     cleanerBackLog=508
     totalLogSize=49,725,425,757
     cleanerBackLog=492
     totalLogSize=50,570,879,399
     cleanerBackLog=476
     totalLogSize=51,465,010,051
     cleanerBackLog=466
     totalLogSize=52,347,693,773
     cleanerBackLog=460
     totalLogSize=53,143,565,971
     cleanerBackLog=450
     totalLogSize=54,061,257,446
     cleanerBackLog=440
     totalLogSize=54,973,962,944
     cleanerBackLog=437
     totalLogSize=55,757,981,268
     cleanerBackLog=429
     totalLogSize=56,508,152,158
     cleanerBackLog=425
     totalLogSize=57,289,083,547
     cleanerBackLog=417
     totalLogSize=57,900,740,776
     cleanerBackLog=414
     totalLogSize=58,712,600,029
     cleanerBackLog=411
     totalLogSize=59,522,908,676
     cleanerBackLog=406
     totalLogSize=60,472,507,769
     cleanerBackLog=401
     totalLogSize=61,418,812,558
     cleanerBackLog=402
     totalLogSize=62,334,769,591
     cleanerBackLog=397
     totalLogSize=63,138,168,923
     cleanerBackLog=402
     totalLogSize=64,055,868,441
     cleanerBackLog=398
     totalLogSize=64,982,882,494
     cleanerBackLog=401
     totalLogSize=65,904,303,917
     cleanerBackLog=395
     totalLogSize=66,725,704,868
     cleanerBackLog=392
     totalLogSize=67,603,991,723
     cleanerBackLog=397
     totalLogSize=68,538,767,238
     cleanerBackLog=395
     totalLogSize=69,472,262,950
     cleanerBackLog=479
     totalLogSize=70,383,584,273
     cleanerBackLog=444
     totalLogSize=71,247,189,752
     cleanerBackLog=410
     totalLogSize=72,191,040,670
     cleanerBackLog=372
     totalLogSize=73,101,233,310
     cleanerBackLog=340
     totalLogSize=73,912,683,883
     cleanerBackLog=305
     totalLogSize=74,792,141,198
     cleanerBackLog=270
     totalLogSize=75,750,888,363
     cleanerBackLog=237
     totalLogSize=76,685,553,473
     cleanerBackLog=205
     totalLogSize=77,493,814,220
     cleanerBackLog=172
     totalLogSize=78,358,968,699
     cleanerBackLog=142
     totalLogSize=79,191,961,498
     cleanerBackLog=109
     totalLogSize=80,082,532,825
     cleanerBackLog=82
     totalLogSize=80,954,944,721
     cleanerBackLog=61
     totalLogSize=81,895,138,184
     cleanerBackLog=43
     totalLogSize=82,921,752,555
     cleanerBackLog=28
     totalLogSize=83,923,249,685
     cleanerBackLog=17
     totalLogSize=84,879,197,438
     cleanerBackLog=5
     totalLogSize=85,871,665,824
     cleanerBackLog=0
     totalLogSize=86,933,060,983
     cleanerBackLog=0
     totalLogSize=87,074,231,796
     cleanerBackLog=0
     totalLogSize=87,074,231,796
     cleanerBackLog=0
-----

I would say that they're independent events (ie: something else is causing the log to grow quickly), however, if I close the DB, it instantly shrinks back down to ~42GB again, then slowly grows as I reopen it and the cleaner backlog approaches 0.

I suspect there may be something internally wrong with my database but am in which directly do head diagnosis-wise to prove/disprove this.

Any thoughts/guidance welcome -

Cheers,

fb.
  • 1. Re: Cleaner apparently doing opposite of intention
    943066 Newbie
    Currently Being Moderated
    Just adding to this, I noticed these are cropping up:
    -----
    com.sleepycat.je.SecondaryIntegrityException: (JE 5.0.55) Secondary refers to a missing key in the primary database
    at com.sleepycat.je.Database.secondaryRefersToMissingPrimaryKey(Database.java:2164) ~[xxxx-1.0-SNAPSHOT.jar:na]
    at com.sleepycat.je.Cursor.readPrimaryAfterGet(Cursor.java:3726) ~[xxxx-1.0-SNAPSHOT.jar:na]
    at com.sleepycat.je.SecondaryCursor.readPrimaryAfterGet(SecondaryCursor.java:1470) ~[xxxx-1.0-SNAPSHOT.jar:na]
    at com.sleepycat.je.SecondaryCursor.getCurrentInternal(SecondaryCursor.java:1343) ~[xxxx-1.0-SNAPSHOT.jar:na]
    at com.sleepycat.je.SecondaryCursor.getCurrent(SecondaryCursor.java:314) ~[xxxx-1.0-SNAPSHOT.jar:na]
    at com.sleepycat.je.SecondaryCursor.getCurrent(SecondaryCursor.java:259) ~[xxxx-1.0-SNAPSHOT.jar:na]
    -----

    Slightly concerning considering I have transactions on (so this theoretically shouldn't happen) - I'm guessing my DB has been left in some sort of inconsistent state. Running com.sleepycat.je.util.DbVerifyLog doesn't output anything though.

    Edited by: fb on Aug 13, 2012 1:34 AM
  • 2. Re: Cleaner apparently doing opposite of intention
    greybird Expert
    Currently Being Moderated
    fb,

    It doesn't make sense that Environment.close would cause disk space reclamation, if you are doing regular checkpoints. And if the file deletion backlog is zero, then I assume you are doing regular checkpoints.

    So I don't know what's going on. The logging of cleaner information added in JE 5.0.58 may give us more information.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 3. Re: Cleaner apparently doing opposite of intention
    909806 Newbie
    Currently Being Moderated
    Actually, fileDeletionBacklog isn't incremented until the end of the checkpoint, inside FileSelector.updateFilesAtCheckpointEnd(), which is called immediately before Cleaner.deleteSafeToDeleteFiles(). So it is always zero unless you are basically in the process of doing the actual deletions. (I both see that in the code and I've observed it in practice.) So not running the checkpointer thread at all would have the symptoms described.
  • 4. Re: Cleaner apparently doing opposite of intention
    greybird Expert
    Currently Being Moderated
    Actually, fileDeletionBacklog isn't incremented until the end of the checkpoint
    You're right, thank you for correcting me on this!
    --mark                                                                                                                                                                                                                                                                                               
  • 5. Re: Cleaner apparently doing opposite of intention
    greybird Expert
    Currently Being Moderated
    fb,

    This doesn't seem related to the cleaner problem to me.

    Are you certain that you're calling DatabaseConfig.setTransactional(true) for the database, and that no data was written to the database in the past when DatabaseConfig.setTransactional(true) was not called?

    And, just to confirm, you are not using a custom key comparator of any kind, and no data was written to the database in the past using a custom key comparator, correct?

    If you're certain of both of these things, the only explanation I know of is that an unknown problem occurred earlier and the residual side effect is this corruption (which is what you're saying also). In this case, there really should have been an earlier exception. Can you find any earlier exceptions in your logs or the je.info logs?

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 6. Re: Cleaner apparently doing opposite of intention
    943066 Newbie
    Currently Being Moderated
    The database has definitely only ever operated with transactions on.

    Unfortunately, I don't have logs going back far enough to be useful to discover a potential past exception.

    I was able to reproduce the above reliably on the given database.

    I am using a custom key comparator (and serializer), however I believe it to be stable (though have been burned here before).

    I have since exported/rebuilt the database to/from a serialized form - Interestingly, the recreated version with the same data came out at approximately 60% of its previous size.

    The database is very volatile and is entirely rewritten at least every 24 hours - I'm going to keep an eye on it and see if the size remains relatively stable.
  • 7. Re: Cleaner apparently doing opposite of intention
    943066 Newbie
    Currently Being Moderated
    Just a follow up to this -

    I ended up having to abandon the aforementioned problematic DB (some 4 weeks ago).

    Interestingly, yesterday, I encountered exactly the same problem with an entirely fresh DB.

    It appears it's possible to get a transactional BDB DB into an inconsistent state by shutting-down non-gracefully under heavy write load. Once in this state, it never recovers (despite RecoveryManager doing it's thing on restart).

    I've only seen this occur on DBs with secondary indexes, after which you regularly get:
    -----
    com.sleepycat.je.SecondaryIntegrityException: (JE 5.0.55) Secondary refers to a missing key in the primary database
    -----

    I haven't tried to replicate in a completely isolated way, but it does make for some concerning behaviour.
  • 8. Re: Cleaner apparently doing opposite of intention
    943066 Newbie
    Currently Being Moderated
    Further to this - this problem seems to occur after recovery runs, ie:
    -----
         at java.util.zip.Adler32.updateBytes(Native Method)
         at java.util.zip.Adler32.update(Adler32.java:49)
         at com.sleepycat.je.log.ChecksumValidator.update(ChecksumValidator.java:58)
         at com.sleepycat.je.log.ChecksumValidator.update(ChecksumValidator.java:54)
         at com.sleepycat.je.log.FileReader.startChecksum(FileReader.java:666)
         at com.sleepycat.je.log.FileReader.readNextEntryAllowExceptions(FileReader.java:302)
         at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:229)
         at com.sleepycat.je.recovery.RecoveryManager.redoLNs(RecoveryManager.java:1284)
         at com.sleepycat.je.recovery.RecoveryManager.buildTree(RecoveryManager.java:602)
         at com.sleepycat.je.recovery.RecoveryManager.recover(RecoveryManager.java:198)
         at com.sleepycat.je.dbi.EnvironmentImpl.finishInit(EnvironmentImpl.java:610)
         - locked <0x000000022fb3e8f8> (a com.sleepycat.je.dbi.EnvironmentImpl)
         at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:210)
         at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:246)
         at com.sleepycat.je.Environment.<init>(Environment.java:227)
         at com.sleepycat.je.Environment.<init>(Environment.java:170)
    -----

    It seems to be very easy to close the DB non-gracefully, which in turn causes very long recovery times - up to 10 minutes for me on an SSD, which is undesirable on a live system.

    I've put a fair amount of effort to ensure that both the DBs and env gets close() called appropriately (with cursors and txns completed), but it's possible I'm missing something.

    Is there anything else particular that needs to be done to ensure a clean close?
  • 9. Re: Cleaner apparently doing opposite of intention
    greybird Expert
    Currently Being Moderated
    I haven't tried to replicate in a completely isolated way
    Since we haven't seen this problem elsewhere, the fastest way to resolve it is for you to reproduce it in an isolated test, and we'll use that test to analyze the problem.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 10. Re: Cleaner apparently doing opposite of intention
    greybird Expert
    Currently Being Moderated
    It seems to be very easy to close the DB non-gracefully, which in turn causes very long recovery times - up to 10 minutes for me on an SSD, which is undesirable on a live system.
    Long recovery times are due to long checkpoint intervals. Decreasing the checkpoint interval is the way to address this -- see EnvironmentConfig.CHECKPOINTER_BYTES_INTERVAL. EnvironmentStats.getNCheckpoints() gives an indication they are happening. Doing a "DbPrintLog -S" will show the most recent checkpoint intervals at the end of the report.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points