5 Replies Latest reply: Dec 10, 2012 6:55 PM by vinothchandar RSS

    Understanding the correlation between cache size and cleaner performance

    vinothchandar
      Hi,

      I am trying to size our caches using a capacity model, that takes into consideration the workload we run and database size/shape etc. (This is JE4.1.17. But I think it should principally apply to BDB5 as well)

      We provide enough memory to
      A -- cache all the Upper INs
      B -- hold a complete log file while cleaning (#cleaners=1) -
      C -- also additional memory to hold BINs that might be brought in by the cleaner when it migrates LNs.
      D -- Plus additional memory to hold dirty BINs from online writes

      (I turned off the checkpointer to make things simple).

      I have the following questions.

      1. Given A, B, C are pretty much static (they don't depend on put rate), how much should I size D to be?

      2. When I reintroduce the checkpointer, it could write out the dirty BINs (currently our bytes.interval is 20MB) from D, causing more garbage? Making it harder for Cleaners to catch up?

      3. All the background threads seem to be triggered per MB of writes to log. Does this include writes coming from Cleaners/eviction/checkpointing as well? i.e are the cleaners/checkpointer triggered solely based on online traffic or does the cleaner's migration also count as a write?

      Some clarification would be great.
      ,
      Thanks
      Vinoth
        • 1. Re: Understanding the correlation between cache size and cleaner performance
          vinothchandar
          For 3, from the code, it seems like the background thread writes also are accounted in the threshold to trigger cleaning and checkpointing..
          • 2. Re: Understanding the correlation between cache size and cleaner performance
            greybird
            1. Given A, B, C are pretty much static (they don't depend on put rate), how much should I size D to be?
            Dirty BINs are written (made non-dirty) by checkpoints. So you'd want to estimate the number of BINs that are dirtied in between checkpoints.
            2. When I reintroduce the checkpointer, it could write out the dirty BINs (currently our bytes.interval is 20MB) from D, causing more garbage? Making it harder for Cleaners to catch up?
            I don't know what you're asking. Checkpoints are necessary to bound recovery time. Yes, they write information that needs to be cleaned later.
            3. All the background threads seem to be triggered per MB of writes to log. Does this include writes coming from Cleaners/eviction/checkpointing as well? i.e are the cleaners/checkpointer triggered solely based on online traffic or does the cleaner's migration also count as a write?
            Right, all writes count.

            --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
            • 3. Re: Understanding the correlation between cache size and cleaner performance
              vinothchandar
              Hi Mark,
              Dirty BINs are written (made non-dirty) by checkpoints. So you'd want to estimate the number of BINs that are dirtied in between checkpoints.
              Alright. if I give enough memory such that D can hold all dirty BINs during a cleaning cycle, things do well. Thanks for confirming.

              About question 2, what I am basically asking is that would very frequent checkpoints hurt by writing out BINs/INs much more frequently? For example, with enough memory and long enough checkpointer interval, it seems to me that some BIN/IN dirtying by the cleaner could be amortized.

              Thanks
              Vinoth
              • 4. Re: Understanding the correlation between cache size and cleaner performance
                greybird
                About question 2, what I am basically asking is that would very frequent checkpoints hurt by writing out BINs/INs much more frequently?
                Yes, the more writing, the lower performance.
                For example, with enough memory and long enough checkpointer interval, it seems to me that some BIN/IN dirtying by the cleaner could be amortized.
                Yes, the checkpoint interval should be as long as possible, but not so long that recovery time will be unacceptable (as required for your app). I suspect your next question will be: How long does recovery take per MB of log? I don't know the answer and it depends on your app, so you'll have to experiment.

                --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                • 5. Re: Understanding the correlation between cache size and cleaner performance
                  vinothchandar
                  I suspect your next question will be: How long does recovery take per MB of log? I don't know the answer and it depends on your app, so you'll have to experiment.
                  Not really.. :) .. For now, I am totally content on coming up with a model that gives predictable stable cleaner performance.

                  [Skip stuff below, if you don't want the details]
                  To give you more context, this relates to our multi tenant deployments at linkedin - multiple dbs/envs on a single server. So, the BDB storage rewrite got rid of most of
                  our scanning/cache pollution/duplicate woes, but exposed the next bottleneck. Right now, we have all these DBs sharing the cache in an adhoc fashion (sharedCache= true).

                  On a particular cluster with a lot of DBs per box, we ran into an issue of cleaners simply doing a lot of IOPS and thus causing very frequent young gen collections.
                  This was impacting our 99th latencies (not that the sky fell over).

                  I am attempting to come up with a model such that we can run a script and generate a cache size for each DB.
                  So, had to dig into all the implementation details. I think I have a fair handle now on the internals.. Thanks for all the help/confirmations.

                  Thanks
                  Vinoth