This discussion is archived
1 2 3 Previous Next 34 Replies Latest reply: Jul 16, 2012 4:13 PM by greybird RSS

Latency increase in upgrage to 4.1.17 from 4.0.92

vinothchandar Newbie
Currently Being Moderated
This has already been brought up with two other versions.
BerkeleyDB Je 4.0.90 to 4.1.7 update yields to increased put latency
Latency increase in je 4.0.117
and we cannot really move to BDB5 at the moment.

We rolled out BDB 4.1.17 and instantly, our put latencies went up the traffic has about 100 writes/sec, no reads. avg increase by 5ms and 95th by 50ms.
I see that we are doing a lot frequent young gen collections.

Any idea what could be causing this? Help/clarification really appreciated.

Thanks
Vinoth
  • 1. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    greybird Expert
    Currently Being Moderated
    I'm sorry, Vinoth, I really don't know. Doing some real performance analysis, which takes significant time, is the only way I would know of finding out.

    --mark                                                                                                                                                                                                                                                                                                                                       
  • 2. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Okay no problems. I will try to debug as far as I can

    Thanks
    Vinoth
  • 3. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    greybird Expert
    Currently Being Moderated
    If you're seeing a difference in GC behavior, be aware of the changes to eviction in JE 4.1, which were quite significant. See item [#18626] in the change log. In general, looking through the change log could help narrow it down.

    To do performance analysis, the brute force method, which is the only method that works for me when the problem isn't obvious, is to dump all the JE stats periodically (be sure to use StatsConfig.setClear(true)), and also take thread dumps periodically. Put all the JE stats in a spreadsheet along with the app latency and throughput stats. Try to correlate the latency issues with changes in the stats and changes in what the thread dumps show.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 4. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Sure. let me try that. From the change logs, I think these are the only changes that could potentially affect the performance, between 4.0.92 to 4.1.17,

    4.1.6
    -- Introduction of a new evictor thread pool [18626] ".
    -- Optimize sync() of database
    -- Better in memory btree compression

    4.0.103
    -- Introduce new cache modes.

    Please let me know if something else comes to mind. Just a question, when the new cache modes (evict_ln, evict_bin) etc were introduced, was there a rewrite of the cache, in terms of data structures used etc? In other words, from your tests, did the introduction of the new cache modes/evictor pool etc worsen (even slightly) the performance of certain workloads? I know you are out of time. So, if you can't answer these from the top of your head, pls don't bother.

    Thanks
    Vinoth
  • 5. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    greybird Expert
    Currently Being Moderated
    Hi Vinoth,

    The new CacheModes did not change the structure of the cache and should not impact any behavior (unless they are used, of course).

    The new eviction mechanism does a much better job at keeping the cache size under the configured limit than before. I guess it is possible that, before (pre JE 4.1) the cache was going above the limit and your app was tolerant of that, and now it is not, but more work is being done. That's just a thought, not something we've seen in action.

    The idea with the new eviction mechanism is to try to offload more of the work to background threads, so that eviction has less of an impact on the app thread. But this is contrary to the symptom you're seeing (increased latency).

    One tidbit (although I don't think this is related to why you would see increased latency from JE 4.0 to 4.1) has to do with something called "critical eviction". Critical eviction is eviction that occurs in the app thread and therefore impacts latency. If you often see critical eviction in thread dumps (Evictor.doCriticalEviction on the stack), that's a clue as to what's happening. It means that eviction in your app thread could be the cause of the increased latency.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 6. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    The latency was due to some cleaning issue, with a growing cleaning backlog. Things seem okay now, although no big improvements overall.

    So, we have added extensive BDB monitoring for Voldemort servers. With this, what I see is that, even with evictln on (environmentConfig.setCacheMode(CacheMode.EVICT_LN);), there is a very low cache miss for the LN nodes. (LNFetchMiss/LNFetches = ~20). Am I not supposed to see a very high cache miss, close to 100% for LNs? Can you throw some light on this?



    Thanks
    Vinoth
  • 7. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    greybird Expert
    Currently Being Moderated
    Hi Vinoth,

    Glad to hear you resolved the latency issue!

    Yes, you should be seeing around 100% LNFetchMiss. But the more suspicious thing is that you're seeing only 20 LNFetches, which makes me wonder if you're somehow getting the stats at the wrong time. Are you expecting the fields in EnvironmentStats to change over time? They won't, you have to call getStats again to refresh them.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 8. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    HI Mark, I meant 20% cache miss rate. sorry about the typo
  • 9. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    No we call getStats() every 5s. I can see the cache size go up and down. So, its definitely not a experiment issue.
    Anyways, good to confirm this from you (expected 100% LNFetchMiss). Let me dig more and get back to you.
  • 10. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    I have double checked and confirmed. The LNMissRatio is no way close to 100% (its around 50% or so). Can you think of any configuration setting which could cause LNs to be held in cache, even when do a environmentConfig.setCacheMode(CacheMode.EVICT_LN); ?

    This is very important to us, to deal with fragmentation issues.
  • 11. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    greybird Expert
    Currently Being Moderated
    Vinoth,

    I just realized something I should have thought of earlier. The fact that you say the miss rate is 50% made me think of this. If you use a JE cursor, LNs are not evicted until the cursor is moved to a different record or the cursor is closed. So if you do a read-modify-write -- which is two operations at the same cursor position -- JE will attempt to fetch the LN twice. The first fetch will be a miss and the second will not be.

    There are other cases where this would cause the stats to show less than 100% miss rate. For example, calling get() twice at the same cursor position.

    Is it possible that something like this is happening?

    I'm confident that EVICT_LN works, in the sense that no LNs are kept in cache (except those currently addressed by a cursor).

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 12. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    That's good info. The traffic is heavily read (100 ops/sec) and little writes (7 ops/sec), this gives a fetch miss rate of 30%.
    For reads, we simply do a cursor walk,

    for(OperationStatus status = cursor.getSearchKey(keyEntry, valueEntry, lockMode); status == OperationStatus.SUCCESS; status = cursor.getNextDup(keyEntry,
    valueEntry,
    lockMode))
    As a test, I brought up a server locally, did 10 gets, this seems to bump up LNFetches by 20 and LNFetchMisses by 10. So, I am not really sure.
    Would this somehow be affected by what you described?

    Just some confirmation questions.

    -- So, EVICT_LN will evict the leaf nodes, even when there is enough place in the cache? I ask this, since I know LNs are not tracked explicitly for eviction, but are evicted when their parent BIN is selected for eviction.
    -- We still have CLEANER_LAZY_MIGRATION on, (we ran into some latency issues as I reported before). So, will there always be some LNs in memory i.e those waiting to be migrated during eviction/checkpointing?

    We ideally don't want any LNs in memory.

    Thanks
    Vinoth
  • 13. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    greybird Expert
    Currently Being Moderated
    As a test, I brought up a server locally, did 10 gets, this seems to bump up LNFetches by 20 and LNFetchMisses by 10. So, I am not really sure.
    Would this somehow be affected by what you described?>

    I'm pretty sure this is just due to the way operations work and stats are collected. A single operation might call fetch() twice internally, just due to the way the algorithm works or the code is structured, even though only at most one true fetch (that could miss) will occur.

    Maybe a better check is that you see one miss for each app operation that you expect to do a fetch. In other words, ignore the fetch stat and only pay attention to the miss stat.
    -- So, EVICT_LN will evict the leaf nodes, even when there is enough place in the cache? I ask this, since I know LNs are not tracked explicitly for eviction, but are evicted when their parent BIN is selected for eviction.
    Yes. The cursor operation itself does the LN eviction.
    -- We still have CLEANER_LAZY_MIGRATION on, (we ran into some latency issues as I reported before). So, will there always be some LNs in memory i.e those waiting to be migrated during eviction/checkpointing?
    Yes. LNs waiting to be migrated will not be evicted (even by EVICT_LN), since this is contrary to the whole point of lazy migration.
    We ideally don't want any LNs in memory.
    Then you'll need to disable lazy migration.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 14. Re: Latency increase in upgrage to 4.1.17 from 4.0.92
    vinothchandar Newbie
    Currently Being Moderated
    Hi mark,

    By operation, do you mean a voldemort operation and a BDB operation? the loop I pasted is for each voldemort operation. Do you see that it might increment LNFetches twice? LNFetchMisses is incremented only once. But we graph, LNFetchMisses / LNFetches, which is 50% max, hence wondering why. Are you saying 50% for misses/fetches is effectively a 100% miss? (I am okay with it being 50%. But want to confirm)
1 2 3 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points