1 2 Previous Next 20 Replies Latest reply: May 10, 2013 7:54 AM by greybird RSS

    Obsolete jdb not being cleaned up

    dimo
      Hi,

      Setup:
      * We are using Oracle NoSQL 1.2.123.
      * We have 3 replication groups with 3 replication nodes each.

      Problem:
      * 2 of the slaves (in 2 different replication groups) occupy much more space in JDB files (10 times more) then all the others. As these are slaves, and writes always go through the master, and all nodes in a replication group have the same data (eventually), I assume that this is stale data that has not been cleaned up by the BDB garbage collection (cleaner threads). Unfortunately the logs do not show anything new (since Dec. last year) and the oldest JDB files are from February.

      Questions:
      * Any ideas what could have gone wrong?
      * What can I do to trigger the cleaners to cleanup the old data? Is that safe to do in production environment and without downtime?
      * Is it really safe to assume that the current data in within a replication groups is really the same?

      Thank you in advance
      Dimo
      PS. A thread dump shows 2 cleaner threads that do nothing.
        • 1. Re: Obsolete jdb not being cleaned up
          greybird
          Hello Dimo,

          The first guess is that the problem is the one Linda Lee mentioned to you earlier:
          Re: Replicas fill hard discs and cause a total failure

          We have a fix for this in an upcoming 2.0-based release, and you may want to start planning to upgrade to 2.0.

          However, before coming to any real conclusions I'd like to get more information.

          1) Using the je.jar file in the lib directory of the release package, please run the following command on the node that is experiencing the problem. The <JE_HOME> directory is the one containing the .jdb files. Please post the output.

          java -jar je.jar DbSpace -h <JE_HOME>

          2) Even though there are no new messages in the logs since December, it may be that the problem started at around that time. Please post any WARNING or SEVERE messages, plus any messages containing the word "cleaner" (case insensitive), that are near the end of the log.

          3) Since no .jdb files have been modified, has your appliction been performing only reads (no writes) since February? How much writing (inserts, deletes, updates) does it do in general, and how much writing do you estimate has occurred since December?

          --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
          • 2. Re: Obsolete jdb not being cleaned up
            dimo
            Hi Mark,

            We did update the cache and heap sizes to more sane values on all 9 nodes when I posted the message to Linda and did not change them afterwards. The problem however occured only on 2 nodes.

            1) Is that safe to execute in a running production system? Does it create high load?
            2), 3) - will gather the data and reply again.

            Thank you,
            Dimo
            • 3. Re: Obsolete jdb not being cleaned up
              greybird
              1) Is that safe to execute in a running production system? Does it create high load?
              It does add some load, so if you're concerned about that then you may want to make a copy of the logs on a non-production system and run the utility there.
              --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
              • 4. Re: Obsolete jdb not being cleaned up
                dimo
                Hi,

                The results from the call are:
                cut
                [user@host ~]$ java -jar /ora/kv-1.2.123/lib/je.jar DbSpace -h /ora/NODE06/STORE/STORE/sn6/rg2-rn3/env
                com.sleepycat.je.EnvironmentFailureException: (JE 5.0.36) /ora/NODE06/STORE/STORE/sn6/rg2-rn3/env last LSN=0x18/0x52af862 LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Environment is invalid and must be closed.
                cut
                That does not sound very good.

                How can this happen and how can we fix it?

                Thank you
                Dimo
                • 5. Re: Obsolete jdb not being cleaned up
                  greybird
                  Please post the complete stack trace.
                  --mark                                                                                                                                                                                                                   
                  • 6. Re: Obsolete jdb not being cleaned up
                    greybird
                    I'm sorry, I think I gave you the wrong command to run. DbSpace is probably not able to find a key comparator class that is in the kvstore.jar.

                    The correct command is:

                    java -cp <KV_HOME>/lib/kvstore.jar com.sleepycat.je.util.DbSpace -h <JE_HOME>

                    kvstore.jar refers to the other jars in its directory, so this will include the je.jar in the classpath.

                    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                    • 7. Re: Obsolete jdb not being cleaned up
                      dimo
                      Hi Mark,

                      this fails too with the following message "Unknown command: DbSpace"

                      Cheers,
                      Dimo
                      • 8. Re: Obsolete jdb not being cleaned up
                        greybird
                        I appologize. I have corrected the command syntax in the message above.
                        --mark                                                                                                                                                                               
                        • 9. Re: Obsolete jdb not being cleaned up
                          dimo
                          Hi Mark,
                          great, that worked. The output is:

                          File Size (KB) % Used
                          -------- --------- ------
                          00000001 1048575 8
                          00000002 1048575 26
                          00000003 1048575 36
                          00000004 1048575 41
                          00000005 1048575 41
                          00000006 1048575 36
                          00000007 1048575 36
                          00000008 1048575 35
                          00000009 1048575 42
                          0000000a 1048574 35
                          0000000b 1048575 41
                          0000000c 1048575 37
                          0000000d 1048575 39
                          0000000e 1048575 41
                          0000000f 1048575 51
                          00000010 1048575 48
                          00000011 1048575 49
                          00000012 1048575 54
                          00000013 1048575 55
                          00000014 1048575 59
                          00000015 1048575 50
                          00000016 1048575 54
                          00000017 1048575 55
                          00000018 1048575 51
                          00000019 1048575 55
                          0000001a 1048575 56
                          0000001b 1048575 69
                          0000001c 1048575 66
                          0000001d 1048575 66
                          0000001e 1048575 66
                          0000001f 1048575 67
                          00000020 1048575 94
                          00000021 450868 93
                          TOTALS 34005291 49
                          (average uncounted LN size, corrected: 171.35216 estimated: 169.26071)

                          The output from the "healthy" master of the same replication group:
                          File Size (KB) % Used
                          -------- --------- ------
                          0000001b 1048575 31
                          0000001c 1048575 23
                          0000001d 1048575 25
                          0000001e 1048575 25
                          0000001f 1048575 33
                          00000020 1048575 81
                          00000021 578968 93
                          TOTALS 6870422 41
                          (average uncounted LN size, corrected: 441.97516 estimated: 441.9595)

                          What does the output tell us?

                          Cheers,
                          Dimo

                          Edited by: dimo on May 2, 2013 4:01 PM
                          • 10. Re: Obsolete jdb not being cleaned up
                            greybird
                            Dimo,

                            Thanks for the output. I can tell from the output (in particular, the difference in the "average uncounted LN size") that this is a known problem with NoSQL DB 1.2. There are two things that should be done about this.

                            1) The simplest and fastest way to correct the replica node is to restore it from the master node. We will send you instructions for doing this later today.

                            2) To prevent this problem from happening in the future, you should upgrade to NoSQL DB 2.0 (latest release) as soon as possible. Several fixes are included in 2.0 that help to avoid this problem.

                            In addition, if you have a commercial license for NoSQL DB you should contact us through the official support channel.

                            I apologize for the bug. One of the reasons we didn't catch this earlier is that it happens primarily with applications that do little or no writing, once the data set has been created. Most of our testing has been oriented around applications that perform continuous writing.

                            --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                            • 11. Re: Obsolete jdb not being cleaned up
                              greybird
                              1) The simplest and fastest way to correct the replica node is to restore it from the master node. We will send you instructions for doing this later today.
                              Here are directions for refreshing the data storage files (.jdb files) on a target node. NoSQL DB will automatically refresh the storage files from another node, after we manually stop the target node, delete its storage files, and finally restart it, as described below. Thanks to Linda Lee for these directions.

                              First, be sure to make a backup.

                              Suppose you want to remove the storage files from rg1-rn3 and make it refresh its files from rg1-rn1. First check where the storage files for the target replication node are located using the show topology command to the Admin CLI. Start the AdminCLI this way:
                                  java -jar KVHOME/lib/kvstore.jar runadmin -host <host> -port <port>
                              Find the directory containing the target Replication Node's files.
                                  kv-> show topology -verbose
                                  store=mystore  numPartitions=100 sequence=108
                                    dc=[dc1] name=MyDC repFactor=3
                              
                                    sn=[sn1]  dc=dc1 localhost:13100 capacity=1 RUNNING
                                      [rg1-rn1] RUNNING  c:/linda/work/smoke/KVRT1/dirB
                                                   single-op avg latency=0.0 ms   multi-op avg latency=0.67391676 ms
                                    sn=[sn2]  dc=dc1 localhost:13200 capacity=1 RUNNING
                                      [rg1-rn2] RUNNING  c:/linda/work/smoke/KVRT2/dirA
                                                No performance info available
                                    sn=[sn3]  dc=dc1 localhost:13300 capacity=1 RUNNING
                                      [rg1-rn3] RUNNING  c:/linda/work/smoke/KVRT3/dirA
                                                   single-op avg latency=0.0 ms   multi-op avg latency=0.53694165 ms
                              
                                    shard=[rg1] num partitions=100
                                      [rg1-rn1] sn=sn1 haPort=localhost:13111
                                      [rg1-rn2] sn=sn2 haPort=localhost:13210
                                      [rg1-rn3] sn=sn3 haPort=localhost:13310
                                      partitions=1-100
                              In this example, rg1-rn3's storage is located in
                                  c:/linda/work/smoke/KVRT3/dirA
                              Stop the target service using the stop-service command
                                  kv-> plan stop-service -service rg1-rn3 -wait
                              In another command shell, remove the files for the target Replication Node
                                  rm c:/linda/work/smoke/KVRT3/dirA/rg1-rn3/env/*.jdb
                              In the Admin CLI, restart the service
                                   plan start-service -service rg1-rn3 -wait
                              The service will restart, and will populate its missing files from one of the other two nodes in the shard. You can use the "verify" or the "show topology" command to check on he status of the store.

                              --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                              • 12. Re: Obsolete jdb not being cleaned up
                                966361
                                Hi,
                                we tried to do the fix described above, unfortunately it failed with:

                                kv-> plan stop-service -service rg1-rn3 -wait
                                Unknown plan subcommand stop-service.


                                Shouldn't it go something like

                                kv-> plan -execute stop-repnodes ... ?

                                Best Regards
                                Christian
                                • 13. Re: Obsolete jdb not being cleaned up
                                  966361
                                  Hi,

                                  plan -execute stop-repnodes 1,3 worked in our test environment, we only had one jdb file there.
                                  The jbd-file was deleted and after
                                  plan -execute start-repnodes 1,3 it was rebuilt from the master node.

                                  Unfortunately the size of the .jdb-file on the master node is quite larger than the replicated .jdb-file on the rg1-rn3 node.

                                  Best Regards
                                  Christian
                                  • 14. Re: Obsolete jdb not being cleaned up
                                    greybird
                                    It sounds like you executed the procedure designed to solve a problem on a replica, when you didn't have the problem in the first place. What motivated you to execute the procedure? If you only have one .jdb file, log cleaning cannot take place. Log cleaning (deletion of .jdb files) can take place only after several .jdb files are present.

                                    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                    1 2 Previous Next