This content has been marked as final. Show 2 replies
I suspect that the problem is something that we know about and have been actively working on. The problem comes up when a store is run with very significantly undersized caches, and the application does a large number of updates, but then ceases all write activity. In the cases we've looked at, the store's underlying log cleaning falls behind during the time of the heavy application load. It would catch
up, except that a bug in some metadata maintenance is gating the log cleanup.
Your case sounds like that, except that you see asymmetrical behavior on the part of the master node. Does the application load consist of only updates, or of mixed updates and reads? That may have some bearing on the asymmetry.
Our R2 pre-release has some improvements for this problem, and we are actively working on a complete solution. But there are really two issues at hand. What you've seen is poor handling of the case when log cleaning falls behind, and we will be fixing that because it can cause the sort of catastrophic out of disk failure you see. But more fundamentally, it may also be that the store is not optimally configured for your load. Fixing the log cleaning issue might still leave you with performance that's not optimal.
We've got some documentation in the Admin Guide on how to come up with starting point configurations to best support the application load and the hardware. If you post more information on your application key and data size, and hardware, we can comment on what might work. For example, it sounds like your application might have large keys and small data. We find that smaller keys are generally more efficient in the NoSQL caches.
We'd be interested in getting more details about your application so that we can use that as a test case for the fix for the log cleaning issue. In our current test cases, the nodes of the cluster all have symmetrical behavior, unlike what you experienced. If that's possible, please contact me at linda dot q dot lee at oracle dot com.
the load consisted of sequences of inserts of large amounts of data (bulk loads), reads and then bulk deletes. I am not sure when it happened for the first time. However, I was able to reproduce the behaviour after a bulk insert with high load into a newly setup kv store. The RN processes were under heavy load even hours after the actual inserts have finished (my guess is a lot of java garbage collection).
Thanks to your hint, we increased the heap sizes and the BDB JE cache sizes of the replication nodes and were able to run the same load without any problems - and with improved response times, throughput and hard disc footprint.
I guess the problem was partly in the a bit misleading admin guide which lead us to believe that we can set the heap size and cache sizes using set policy command but we had to use plan -execute change-all-repnode-params "cacheSize=.." and plan -execute change-all-repnode-params "javaMiscParams=..".
Thanks again for the help