This content has been marked as final. Show 20 replies
The first guess is that the problem is the one Linda Lee mentioned to you earlier:
Re: Replicas fill hard discs and cause a total failure
We have a fix for this in an upcoming 2.0-based release, and you may want to start planning to upgrade to 2.0.
However, before coming to any real conclusions I'd like to get more information.
1) Using the je.jar file in the lib directory of the release package, please run the following command on the node that is experiencing the problem. The <JE_HOME> directory is the one containing the .jdb files. Please post the output.
java -jar je.jar DbSpace -h <JE_HOME>
2) Even though there are no new messages in the logs since December, it may be that the problem started at around that time. Please post any WARNING or SEVERE messages, plus any messages containing the word "cleaner" (case insensitive), that are near the end of the log.
3) Since no .jdb files have been modified, has your appliction been performing only reads (no writes) since February? How much writing (inserts, deletes, updates) does it do in general, and how much writing do you estimate has occurred since December?
We did update the cache and heap sizes to more sane values on all 9 nodes when I posted the message to Linda and did not change them afterwards. The problem however occured only on 2 nodes.
1) Is that safe to execute in a running production system? Does it create high load?
2), 3) - will gather the data and reply again.
The results from the call are:
[user@host ~]$ java -jar /ora/kv-1.2.123/lib/je.jar DbSpace -h /ora/NODE06/STORE/STORE/sn6/rg2-rn3/env
com.sleepycat.je.EnvironmentFailureException: (JE 5.0.36) /ora/NODE06/STORE/STORE/sn6/rg2-rn3/env last LSN=0x18/0x52af862 LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Environment is invalid and must be closed.
That does not sound very good.
How can this happen and how can we fix it?
I'm sorry, I think I gave you the wrong command to run. DbSpace is probably not able to find a key comparator class that is in the kvstore.jar.
The correct command is:
java -cp <KV_HOME>/lib/kvstore.jar com.sleepycat.je.util.DbSpace -h <JE_HOME>
kvstore.jar refers to the other jars in its directory, so this will include the je.jar in the classpath.
great, that worked. The output is:
File Size (KB) % Used
-------- --------- ------
00000001 1048575 8
00000002 1048575 26
00000003 1048575 36
00000004 1048575 41
00000005 1048575 41
00000006 1048575 36
00000007 1048575 36
00000008 1048575 35
00000009 1048575 42
0000000a 1048574 35
0000000b 1048575 41
0000000c 1048575 37
0000000d 1048575 39
0000000e 1048575 41
0000000f 1048575 51
00000010 1048575 48
00000011 1048575 49
00000012 1048575 54
00000013 1048575 55
00000014 1048575 59
00000015 1048575 50
00000016 1048575 54
00000017 1048575 55
00000018 1048575 51
00000019 1048575 55
0000001a 1048575 56
0000001b 1048575 69
0000001c 1048575 66
0000001d 1048575 66
0000001e 1048575 66
0000001f 1048575 67
00000020 1048575 94
00000021 450868 93
TOTALS 34005291 49
(average uncounted LN size, corrected: 171.35216 estimated: 169.26071)
The output from the "healthy" master of the same replication group:
File Size (KB) % Used
-------- --------- ------
0000001b 1048575 31
0000001c 1048575 23
0000001d 1048575 25
0000001e 1048575 25
0000001f 1048575 33
00000020 1048575 81
00000021 578968 93
TOTALS 6870422 41
(average uncounted LN size, corrected: 441.97516 estimated: 441.9595)
What does the output tell us?
Edited by: dimo on May 2, 2013 4:01 PM
Thanks for the output. I can tell from the output (in particular, the difference in the "average uncounted LN size") that this is a known problem with NoSQL DB 1.2. There are two things that should be done about this.
1) The simplest and fastest way to correct the replica node is to restore it from the master node. We will send you instructions for doing this later today.
2) To prevent this problem from happening in the future, you should upgrade to NoSQL DB 2.0 (latest release) as soon as possible. Several fixes are included in 2.0 that help to avoid this problem.
In addition, if you have a commercial license for NoSQL DB you should contact us through the official support channel.
I apologize for the bug. One of the reasons we didn't catch this earlier is that it happens primarily with applications that do little or no writing, once the data set has been created. Most of our testing has been oriented around applications that perform continuous writing.
1) The simplest and fastest way to correct the replica node is to restore it from the master node. We will send you instructions for doing this later today.Here are directions for refreshing the data storage files (.jdb files) on a target node. NoSQL DB will automatically refresh the storage files from another node, after we manually stop the target node, delete its storage files, and finally restart it, as described below. Thanks to Linda Lee for these directions.
First, be sure to make a backup.
Suppose you want to remove the storage files from rg1-rn3 and make it refresh its files from rg1-rn1. First check where the storage files for the target replication node are located using the show topology command to the Admin CLI. Start the AdminCLI this way:
Find the directory containing the target Replication Node's files.
java -jar KVHOME/lib/kvstore.jar runadmin -host <host> -port <port>
In this example, rg1-rn3's storage is located in
kv-> show topology -verbose store=mystore numPartitions=100 sequence=108 dc=[dc1] name=MyDC repFactor=3 sn=[sn1] dc=dc1 localhost:13100 capacity=1 RUNNING [rg1-rn1] RUNNING c:/linda/work/smoke/KVRT1/dirB single-op avg latency=0.0 ms multi-op avg latency=0.67391676 ms sn=[sn2] dc=dc1 localhost:13200 capacity=1 RUNNING [rg1-rn2] RUNNING c:/linda/work/smoke/KVRT2/dirA No performance info available sn=[sn3] dc=dc1 localhost:13300 capacity=1 RUNNING [rg1-rn3] RUNNING c:/linda/work/smoke/KVRT3/dirA single-op avg latency=0.0 ms multi-op avg latency=0.53694165 ms shard=[rg1] num partitions=100 [rg1-rn1] sn=sn1 haPort=localhost:13111 [rg1-rn2] sn=sn2 haPort=localhost:13210 [rg1-rn3] sn=sn3 haPort=localhost:13310 partitions=1-100
Stop the target service using the stop-service command
In another command shell, remove the files for the target Replication Node
kv-> plan stop-service -service rg1-rn3 -wait
In the Admin CLI, restart the service
The service will restart, and will populate its missing files from one of the other two nodes in the shard. You can use the "verify" or the "show topology" command to check on he status of the store.
plan start-service -service rg1-rn3 -wait
plan -execute stop-repnodes 1,3 worked in our test environment, we only had one jdb file there.
The jbd-file was deleted and after
plan -execute start-repnodes 1,3 it was rebuilt from the master node.
Unfortunately the size of the .jdb-file on the master node is quite larger than the replicated .jdb-file on the rg1-rn3 node.
It sounds like you executed the procedure designed to solve a problem on a replica, when you didn't have the problem in the first place. What motivated you to execute the procedure? If you only have one .jdb file, log cleaning cannot take place. Log cleaning (deletion of .jdb files) can take place only after several .jdb files are present.