This content has been marked as final. Show 34 replies
Following eviction code from CursorImpl, I see that under the following conditions, the cursor may not evict the LN right away.
1. Its BIN is dirty and will be migrated soon (i.e Lazy migration)
2. The file that the LN is in, will be cleaned shortly
3. Some experimental clustering, which is off by default.
Don't you think we should do base eviction on 1, 2 only when the LN is dirty? Even when logging BINs, JE logs deltas. So, I think as long as the LN is not dirty, we should let go here..
Please let me know what you think.
For now, I just need the confirmation on the fetchMiss/fetch stat interpretation from you. I will try to play with lazy migration to see if I can get more LNs off the JVM heap.
1. Its BIN is dirty and will be migrated soon (i.e Lazy migration)2. The file that the LN is in, will be cleaned shortly
3. Some experimental clustering, which is off by default. >
None of these will happen except 1, and only if you have lazy migration on.
For now, I just need the confirmation on the fetchMiss/fetch stat interpretation from you. I will try to play with lazy migration to see if I can get more LNs off the JVM heap.I claim that if you turn off lazy migration, and you turn on EVICT_LN, that no LNs will be in cache (except for an active operation). The fetch stat will not help to see whether this is happening (as I explained above), so you can either trust me or do more in-depth debugging to get more info.
If you think that LNs are not being evicted when they should be, the simplest thing is to write a very small test to confirm this. For each get(), you should see one cache miss. (Ignore the fetch stat, just look at the misses.) The test you described earlier is seeing one miss per get(), which confirms that EVICT_LN is working.
I'm sorry if I confused the issue with what I said earlier about the use of the nLNsFetch and nLNsFetchMiss. The problem is that nLNsFetch is not the same as the number of records read (number of get() operations). So using these two stats to "gauge cache hit/miss ratios" (as the javadoc says!) isn't accurate. What is lacking in the stats are counters for the number of read and write operations.
I've filed ticket [#21554], so we can think about the need for more stats later on, or at least remember to correct the javadoc.
That's okay. For verification purposes, I think we will do a 'jmap histo' and see how much LNs are there in the JVM heap. Its not that I don't trust you :) (of course I do, you are awesome). We are seeing heavy CMS activity, usually indicating fragmentation. I assume index nodes are all the same size pretty much, since I expect most of them to contain > 4 children, which means they will have default representation in memory. Since the CMS is real and could only be caused only by LN nodes being in the heap, I am super anxious to get this working.
We currently have 3 cleaners with lazy migration turned on. I am going to try reducing the cleaners to 1 (its SSD, it should be able to keep up) and/or turning lazy migration off. This should drive away fragmentation. (if not, I am not sure what else to do). Will keep this thread updated
I think it's worth calculating the LN miss rate using (nFetchMiss / nOps) where nOps is the number of JE operations you perform and is computed by your app. That way, if it is not 100%, then you can add more debugging to find out what ops are not evicting the LN.
It is also possible the CMS activity is not due to LNs, but to mutations in BIN representations. We don't see this in our tests, but that doesn't mean it can't happen with a different access pattern.
Looking at INTargetRep is a good tactic. Assuming LNs are not kept in cache, this should mostly be None, with some Sparse that are in the middle of an operation. This is what we see in JE 5 anyway. I assume you've seen the stats for this.
When you say that the rep should be Default because they contain more than 4 children, that shouldn't be the case with EVICT_LN. It's the number of resident children that determines the rep, and LNs should not be resident. I'm speaking of the BINs, not the upper INs. The BINs dominate the cache.
You should not trust me completely, seriously, since I often make mistakes and in this case it's possible I'm thinking of JE 5 behavior. I haven't done JE 4 performance testing since we started on JE 5, at least two years ago. It's possible that we improved behavior in this area in JE 5, even though EVICT_LN was present in JE 4.
You were right. The lazy cleaner migration does reduce the number of LNs in the cache by ~75%. But we are still seeing high fragmentation even when cache is dominated by index nodes. We have assumed that the keys are uniform size so far. Is that true? (i.e would nt you have to make it some uniform size so you can compare them?).
Pasted a histogram of the heap below
num #instances #bytes class name
1: 98291348 14291470352 [B
2: 1039956 2074705520 [J
3: 1005044 1023373616 [[B
4: 16374 184680608 [I
5: 1043765 147572144 [Lcom.sleepycat.je.tree.Node;
6: 1349020 69703856 [C
7: 393821 53559656 com.sleepycat.je.tree.BIN
8: 324342 46705248 com.sleepycat.je.tree.DBIN
9: 348798 44646144 com.sleepycat.je.tree.DIN
10: 1073152 42926080 java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync
11: 1073103 42924120 com.sleepycat.je.latch.SharedLatch
12: 1263927 40445664 java.lang.String
13: 1063045 34017440 java.util.concurrent.ConcurrentHashMap$HashEntry
14: 985414 31533248 com.sleepycat.je.tree.INTargetRep$Sparse
15: 997399 25219952 [S
16: 302535 25094208 [Ljava.util.HashMap$Entry;
17: 1004026 24096624 com.sleepycat.je.tree.INKeyRep$Default
18: 718163 17235912 com.sleepycat.je.utilint.TinyHashSet
19: 1073160 17170560 java.util.concurrent.locks.ReentrantReadWriteLock$Sync$ThreadLocalHoldCounter
20: 1073152 17170432 java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock
21: 1073152 17170432 java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock
22: 298177 14312496 java.util.HashMap
23: 421027 13472864 com.sleepycat.je.tree.LN
24: 4164 11680680 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
25: 350754 11224128 com.sleepycat.je.tree.ChildReference
26: 81357 11137784 <constMethodKlass>
27: 81357 11074696 <methodKlass>
28: 257973 8255136 java.util.HashMap$Entry
29: 6736 7647456 <constantPoolKlass>
30: 101940 5902336 <symbolKlass>
31: 6736 5515992 <instanceKlassKlass>
32: 150468 4814976 java.lang.ThreadLocal$ThreadLocalMap$Entry
33: 5815 4702328 <constantPoolCacheKlass>
No, keys are variable length and we can see that INKeyRep$Default is used, meaning one byte array per key. If the key size (without common prefix) is <= 16 bytes, we use a single byte array per BIN (INKeyRep$MaxKeySize). In JE 5 we added a config option for this constant (if 16 isn't best for your app) but that doesn't help you on JE 4.1. If you haven't configured key prefixing, you should, to have a chance of using INKeyRep$MaxKeySize.
I see that all your BINs are using INTargetRep$Sparse. We added an optimization in JE 5 so that the INTargetRep$Sparse representation reverts to INTargetRep$None (0 bytes) when a BIN is empty. This consumes less cache. But the INTargetRep$Sparse object should not be allocated/freed often, so is probably not your GC issue.
The main thing that strikes me is that you have as many DIN/DBIN nodes as BINs. This means (if I understand Voldemort) that your use of duplicates is very wasteful of cache memory, although I don't know if that's causing eviction which is then causing GC issues -- it certainly could be, since it triples the number of INs. It looks like many keys have a duplicate, meaning (I think) in Voldemort that these records have two versions. The worst case would be that there are many DBINs that contain only two keys each. I had understood that Voldemort would rarely create a duplicate (2nd version for a record), but what we're seeing in the heap dump is contrary to this, unless there is another explanation for the DIN/DBIN numbers (some other use of duplicates in your test?).
I'm on vacation until next week.
Thanks for confirming. Makes sense. We will think about dealing with this.
Regarding the duplicates, voldemort could potentially have a lot of duplicates (~20%) with highly concurrent writes. and yes, if we do better in this front that would be great. I definitely have this noted down
But right now, the fragmentation is more worrying and hurting us as I type.
Have a good vacation! We can chat later.
Well. There is always a factor which I tend to not mention in our conversations, to not confuse you. This relates to our multitenancy. In this particular case, there is ~600G of data shared across 20+ environments over a 10GB shared BDB cache. I think the fragmentation is due different key sizes across stores. As a short term fix, I am going to try keyprefixing, to see if it helps put out this fire for now.
I would need to scope out a long term generic solution for this, since we are not moving away from multitenancy in the future (due to obvious operational advantages). Hope that clarifies where I am coming from.