This discussion is archived
1 2 Previous Next 20 Replies Latest reply: May 10, 2013 5:54 AM by greybird RSS

Obsolete jdb not being cleaned up

dimo Newbie
Currently Being Moderated
Hi,

Setup:
* We are using Oracle NoSQL 1.2.123.
* We have 3 replication groups with 3 replication nodes each.

Problem:
* 2 of the slaves (in 2 different replication groups) occupy much more space in JDB files (10 times more) then all the others. As these are slaves, and writes always go through the master, and all nodes in a replication group have the same data (eventually), I assume that this is stale data that has not been cleaned up by the BDB garbage collection (cleaner threads). Unfortunately the logs do not show anything new (since Dec. last year) and the oldest JDB files are from February.

Questions:
* Any ideas what could have gone wrong?
* What can I do to trigger the cleaners to cleanup the old data? Is that safe to do in production environment and without downtime?
* Is it really safe to assume that the current data in within a replication groups is really the same?

Thank you in advance
Dimo
PS. A thread dump shows 2 cleaner threads that do nothing.
  • 1. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    Hello Dimo,

    The first guess is that the problem is the one Linda Lee mentioned to you earlier:
    Re: Replicas fill hard discs and cause a total failure

    We have a fix for this in an upcoming 2.0-based release, and you may want to start planning to upgrade to 2.0.

    However, before coming to any real conclusions I'd like to get more information.

    1) Using the je.jar file in the lib directory of the release package, please run the following command on the node that is experiencing the problem. The <JE_HOME> directory is the one containing the .jdb files. Please post the output.

    java -jar je.jar DbSpace -h <JE_HOME>

    2) Even though there are no new messages in the logs since December, it may be that the problem started at around that time. Please post any WARNING or SEVERE messages, plus any messages containing the word "cleaner" (case insensitive), that are near the end of the log.

    3) Since no .jdb files have been modified, has your appliction been performing only reads (no writes) since February? How much writing (inserts, deletes, updates) does it do in general, and how much writing do you estimate has occurred since December?

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 2. Re: Obsolete jdb not being cleaned up
    dimo Newbie
    Currently Being Moderated
    Hi Mark,

    We did update the cache and heap sizes to more sane values on all 9 nodes when I posted the message to Linda and did not change them afterwards. The problem however occured only on 2 nodes.

    1) Is that safe to execute in a running production system? Does it create high load?
    2), 3) - will gather the data and reply again.

    Thank you,
    Dimo
  • 3. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    1) Is that safe to execute in a running production system? Does it create high load?
    It does add some load, so if you're concerned about that then you may want to make a copy of the logs on a non-production system and run the utility there.
    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 4. Re: Obsolete jdb not being cleaned up
    dimo Newbie
    Currently Being Moderated
    Hi,

    The results from the call are:
    cut
    [user@host ~]$ java -jar /ora/kv-1.2.123/lib/je.jar DbSpace -h /ora/NODE06/STORE/STORE/sn6/rg2-rn3/env
    com.sleepycat.je.EnvironmentFailureException: (JE 5.0.36) /ora/NODE06/STORE/STORE/sn6/rg2-rn3/env last LSN=0x18/0x52af862 LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Environment is invalid and must be closed.
    cut
    That does not sound very good.

    How can this happen and how can we fix it?

    Thank you
    Dimo
  • 5. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    Please post the complete stack trace.
    --mark                                                                                                                                                                                                                   
  • 6. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    I'm sorry, I think I gave you the wrong command to run. DbSpace is probably not able to find a key comparator class that is in the kvstore.jar.

    The correct command is:

    java -cp <KV_HOME>/lib/kvstore.jar com.sleepycat.je.util.DbSpace -h <JE_HOME>

    kvstore.jar refers to the other jars in its directory, so this will include the je.jar in the classpath.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 7. Re: Obsolete jdb not being cleaned up
    dimo Newbie
    Currently Being Moderated
    Hi Mark,

    this fails too with the following message "Unknown command: DbSpace"

    Cheers,
    Dimo
  • 8. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    I appologize. I have corrected the command syntax in the message above.
    --mark                                                                                                                                                                               
  • 9. Re: Obsolete jdb not being cleaned up
    dimo Newbie
    Currently Being Moderated
    Hi Mark,
    great, that worked. The output is:

    File Size (KB) % Used
    -------- --------- ------
    00000001 1048575 8
    00000002 1048575 26
    00000003 1048575 36
    00000004 1048575 41
    00000005 1048575 41
    00000006 1048575 36
    00000007 1048575 36
    00000008 1048575 35
    00000009 1048575 42
    0000000a 1048574 35
    0000000b 1048575 41
    0000000c 1048575 37
    0000000d 1048575 39
    0000000e 1048575 41
    0000000f 1048575 51
    00000010 1048575 48
    00000011 1048575 49
    00000012 1048575 54
    00000013 1048575 55
    00000014 1048575 59
    00000015 1048575 50
    00000016 1048575 54
    00000017 1048575 55
    00000018 1048575 51
    00000019 1048575 55
    0000001a 1048575 56
    0000001b 1048575 69
    0000001c 1048575 66
    0000001d 1048575 66
    0000001e 1048575 66
    0000001f 1048575 67
    00000020 1048575 94
    00000021 450868 93
    TOTALS 34005291 49
    (average uncounted LN size, corrected: 171.35216 estimated: 169.26071)

    The output from the "healthy" master of the same replication group:
    File Size (KB) % Used
    -------- --------- ------
    0000001b 1048575 31
    0000001c 1048575 23
    0000001d 1048575 25
    0000001e 1048575 25
    0000001f 1048575 33
    00000020 1048575 81
    00000021 578968 93
    TOTALS 6870422 41
    (average uncounted LN size, corrected: 441.97516 estimated: 441.9595)

    What does the output tell us?

    Cheers,
    Dimo

    Edited by: dimo on May 2, 2013 4:01 PM
  • 10. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    Dimo,

    Thanks for the output. I can tell from the output (in particular, the difference in the "average uncounted LN size") that this is a known problem with NoSQL DB 1.2. There are two things that should be done about this.

    1) The simplest and fastest way to correct the replica node is to restore it from the master node. We will send you instructions for doing this later today.

    2) To prevent this problem from happening in the future, you should upgrade to NoSQL DB 2.0 (latest release) as soon as possible. Several fixes are included in 2.0 that help to avoid this problem.

    In addition, if you have a commercial license for NoSQL DB you should contact us through the official support channel.

    I apologize for the bug. One of the reasons we didn't catch this earlier is that it happens primarily with applications that do little or no writing, once the data set has been created. Most of our testing has been oriented around applications that perform continuous writing.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 11. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    1) The simplest and fastest way to correct the replica node is to restore it from the master node. We will send you instructions for doing this later today.
    Here are directions for refreshing the data storage files (.jdb files) on a target node. NoSQL DB will automatically refresh the storage files from another node, after we manually stop the target node, delete its storage files, and finally restart it, as described below. Thanks to Linda Lee for these directions.

    First, be sure to make a backup.

    Suppose you want to remove the storage files from rg1-rn3 and make it refresh its files from rg1-rn1. First check where the storage files for the target replication node are located using the show topology command to the Admin CLI. Start the AdminCLI this way:
        java -jar KVHOME/lib/kvstore.jar runadmin -host <host> -port <port>
    Find the directory containing the target Replication Node's files.
        kv-> show topology -verbose
        store=mystore  numPartitions=100 sequence=108
          dc=[dc1] name=MyDC repFactor=3
    
          sn=[sn1]  dc=dc1 localhost:13100 capacity=1 RUNNING
            [rg1-rn1] RUNNING  c:/linda/work/smoke/KVRT1/dirB
                         single-op avg latency=0.0 ms   multi-op avg latency=0.67391676 ms
          sn=[sn2]  dc=dc1 localhost:13200 capacity=1 RUNNING
            [rg1-rn2] RUNNING  c:/linda/work/smoke/KVRT2/dirA
                      No performance info available
          sn=[sn3]  dc=dc1 localhost:13300 capacity=1 RUNNING
            [rg1-rn3] RUNNING  c:/linda/work/smoke/KVRT3/dirA
                         single-op avg latency=0.0 ms   multi-op avg latency=0.53694165 ms
    
          shard=[rg1] num partitions=100
            [rg1-rn1] sn=sn1 haPort=localhost:13111
            [rg1-rn2] sn=sn2 haPort=localhost:13210
            [rg1-rn3] sn=sn3 haPort=localhost:13310
            partitions=1-100
    In this example, rg1-rn3's storage is located in
        c:/linda/work/smoke/KVRT3/dirA
    Stop the target service using the stop-service command
        kv-> plan stop-service -service rg1-rn3 -wait
    In another command shell, remove the files for the target Replication Node
        rm c:/linda/work/smoke/KVRT3/dirA/rg1-rn3/env/*.jdb
    In the Admin CLI, restart the service
         plan start-service -service rg1-rn3 -wait
    The service will restart, and will populate its missing files from one of the other two nodes in the shard. You can use the "verify" or the "show topology" command to check on he status of the store.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 12. Re: Obsolete jdb not being cleaned up
    966361 Newbie
    Currently Being Moderated
    Hi,
    we tried to do the fix described above, unfortunately it failed with:

    kv-> plan stop-service -service rg1-rn3 -wait
    Unknown plan subcommand stop-service.


    Shouldn't it go something like

    kv-> plan -execute stop-repnodes ... ?

    Best Regards
    Christian
  • 13. Re: Obsolete jdb not being cleaned up
    966361 Newbie
    Currently Being Moderated
    Hi,

    plan -execute stop-repnodes 1,3 worked in our test environment, we only had one jdb file there.
    The jbd-file was deleted and after
    plan -execute start-repnodes 1,3 it was rebuilt from the master node.

    Unfortunately the size of the .jdb-file on the master node is quite larger than the replicated .jdb-file on the rg1-rn3 node.

    Best Regards
    Christian
  • 14. Re: Obsolete jdb not being cleaned up
    greybird Expert
    Currently Being Moderated
    It sounds like you executed the procedure designed to solve a problem on a replica, when you didn't have the problem in the first place. What motivated you to execute the procedure? If you only have one .jdb file, log cleaning cannot take place. Log cleaning (deletion of .jdb files) can take place only after several .jdb files are present.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points